date:20151124

Re: [PATCH net] net: ipmr, ip6mr: fix vif/tunnel failure race condition

2015-11-24 Thread David Miller

From: Nikolay Aleksandrov 
Date: Tue, 24 Nov 2015 17:09:30 +0100

> From: Nikolay Aleksandrov 
> 
> Since (at least) commit b17a7c179dd3 ("[NET]: Do sysfs registration as
> part of register_netdevice."), netdev_run_todo() deals only with
> unregistration, so we don't need to do the rtnl_unlock/lock cycle to
> finish registration when failing pimreg or dvmrp device creation. In
> fact that opens a race condition where someone can delete the device
> while rtnl is unlocked because it's fully registered. The problem gets
> worse when netlink support is introduced as there are more points of entry
> that can cause it and it also makes reusing that code correctly impossible.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
> I was able to crash the kernel by artificially triggering this race just to
> confim it. I know it's very unlikely to hit the race in the real world, but
> the biggest advantage of the change is that the code can be re-used later
> when adding netlink support.

Applied, thank you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal

David Miller  wrote:
> From: Florian Westphal 
> Date: Tue, 24 Nov 2015 23:22:42 +0100
> 
> > Yes, I get that point, but I maintain that KCM is a strange workaround
> > for bad userspace design.
> 
> I fundamentally disagree with you.

Fair enough.  Still, I do not see how what KCM intends to do
can be achieved while at the same time imposing some upper bound on
the amount of kernel memory we can allocate globally and per socket.

Once such limit would be enforced the question becomes how the kernel
could handle such an error other than via close of the underlying tcp
connection.

Not adding any limit is not a good idea in my opinion.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] xprtrdma: add missing curly braces, set rc to zero on non-zero

2015-11-24 Thread Colin King

From: Colin Ian King 

Add the missing curly braces so that rc is only set to zero when
it is non-zero.  Without this minor fix, rc is set to zero even
when it is zero, which is slightly redundant.

Detected with smatch static analysis.

Signed-off-by: Colin Ian King 
---
 net/sunrpc/xprtrdma/verbs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index eadd1655..2cc1014 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -852,10 +852,11 @@ retry:
 
if (extras) {
rc = rpcrdma_ep_post_extra_recv(r_xprt, extras);
-   if (rc)
+   if (rc) {
pr_warn("%s: rpcrdma_ep_post_extra_recv: %i\n",
__func__, rc);
rc = 0;
+   }
}
}
 
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/1] net: add killer e2400 device id

2015-11-24 Thread Owen Lin

Add Killer E2400 device ID in alx driver.

Signed-off-by: Owen Lin o...@rivetnetworks.com



diff -up1rN alx_orig/main.c alx/main.c
--- alx_orig/main.c Wed Nov 25 08:01:49 2015
+++ alx/main.c  Wed Nov 25 08:05:20 2015
@@ -1539,2 +1539,3 @@ static const struct pci_device_id alx_pc
{ PCI_VDEVICE(ATTANSIC, ALX_DEV_ID_AR8171) },
+   { PCI_VDEVICE(ATTANSIC, ALX_DEV_ID_E2400) },
{ PCI_VDEVICE(ATTANSIC, ALX_DEV_ID_AR8172) },
diff -up1rN alx_orig/reg.h alx/reg.h
--- alx_orig/reg.h  Wed Nov 25 08:01:05 2015
+++ alx/reg.h   Wed Nov 25 08:03:39 2015
@@ -38,5 +38,6 @@
 #define ALX_DEV_ID_AR8161  0x1091
-#define ALX_DEV_ID_E2200   0xe091
+#define ALX_DEV_ID_E2200   0xE091
 #define ALX_DEV_ID_AR8162  0x1090
 #define ALX_DEV_ID_AR8171  0x10A1
+#define ALX_DEV_ID_E2400   0xE0A1
 #define ALX_DEV_ID_AR8172  0x10A0 

 /* rev definition,
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Benjamin LaHaise

On Tue, Nov 24, 2015 at 04:30:01PM -0500, Jason Baron wrote:
> So looking at this trace I think its the other->sk_socket that gets
> freed and then we call sk_wake_async() on it.
> 
> We could I think grab the socket reference there with unix_state_lock(),
> since that is held by unix_release_sock() before the final iput() is called.
> 
> So something like below might work (compile tested only):

That just adds the performance regression back in.  It should be possible 
to protect the other socket dereference using RCU.  I haven't had time to 
look at this yet today, but will try to find some time this evening to come 
up with a suggested patch.

-ben

> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index aaa0b58..2b014f1 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -196,6 +196,19 @@ static inline int unix_recvq_full(struct sock const
> *sk)
>   return skb_queue_len(>sk_receive_queue) > sk->sk_max_ack_backlog;
>  }
> 
> +struct socket *unix_peer_get_socket(struct sock *s)
> +{
> + struct socket *peer;
> +
> + unix_state_lock(s);
> + peer = s->sk_socket;
> + if (peer)
> + __iget(SOCK_INODE(s->sk_socket));
> + unix_state_unlock(s);
> +
> + return peer;
> +}
> +
>  struct sock *unix_peer_get(struct sock *s)
>  {
>   struct sock *peer;
> @@ -1639,6 +1652,7 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>  {
>   struct sock *sk = sock->sk;
>   struct sock *other = NULL;
> + struct socket *other_socket = NULL;
>   int err, size;
>   struct sk_buff *skb;
>   int sent = 0;
> @@ -1662,7 +1676,10 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>   } else {
>   err = -ENOTCONN;
>   other = unix_peer(sk);
> - if (!other)
> + if (other)
> + other_socket = unix_peer_get_socket(other);
> +
> + if (!other_socket)
>   goto out_err;
>   }
> 
> @@ -1721,6 +1738,9 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>   sent += size;
>   }
> 
> + if (other_socket)
> + iput(SOCK_INODE(other_socket));
> +
>   scm_destroy();
> 
>   return sent;
> @@ -1733,6 +1753,8 @@ pipe_err:
>   send_sig(SIGPIPE, current, 0);
>   err = -EPIPE;
>  out_err:
> + if (other_socket)
> + iput(SOCK_INODE(other_socket));
>   scm_destroy();
>   return sent ? : err;
>  }

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rxrpc: Correctly handle ack at end of client call transmit phase

2015-11-24 Thread David Miller

From: David Howells 
Date: Tue, 24 Nov 2015 14:41:59 +

> Normally, the transmit phase of a client call is implicitly ack'd by the
> reception of the first data packet of the response being received.
> However, if a security negotiation happens, the transmit phase, if it is
> entirely contained in a single packet, may get an ack packet in response
> and then may get aborted due to security negotiation failure.
> 
> Because the client has shifted state to RXRPC_CALL_CLIENT_AWAIT_REPLY due
> to having transmitted all the data, the code that handles processing of the
> received ack packet doesn't note the hard ack the data packet.
> 
> The following abort packet in the case of security negotiation failure then
> incurs an assertion failure when it tries to drain the Tx queue because the
> hard ack state is out of sync (hard ack means the packets have been
> processed and can be discarded by the sender; a soft ack means that the
> packets are received but could still be discarded and rerequested by the
> receiver).
> 
> To fix this, we should record the hard ack we received for the ack packet.
> 
> The assertion failure looks like:
...
> Signed-off-by: David Howells 

Applied, thanks David.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] bpf: fix clearing on persistent program array maps

2015-11-24 Thread David Miller

From: Daniel Borkmann 
Date: Tue, 24 Nov 2015 21:28:15 +0100

> Currently, when having map file descriptors pointing to program arrays,
> there's still the issue that we unconditionally flush program array
> contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
> when such a file descriptor is released and is independent of the map's
> refcount.
> 
> Having this flush independent of the refcount is for a reason: there
> can be arbitrary complex dependency chains among tail calls, also circular
> ones (direct or indirect, nesting limit determined during runtime), and
> we need to make sure that the map drops all references to eBPF programs
> it holds, so that the map's refcount can eventually drop to zero and
> initiate its freeing. Btw, a walk of the whole dependency graph would
> not be possible for various reasons, one being complexity and another
> one inconsistency, i.e. new programs can be added to parts of the graph
> at any time, so there's no guaranteed consistent state for the time of
> such a walk.
> 
> Now, the program array pinning itself works, but the issue is that each
> derived file descriptor on close would nevertheless call unconditionally
> into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
> this flush until the last reference to a user is dropped. As this only
> concerns a subset of references (f.e. a prog array could hold a program
> that itself has reference on the prog array holding it, etc), we need to
> track them separately.
> 
> Short analysis on the refcounting: on map creation time usercnt will be
> one, so there's no change in behaviour for bpf_map_release(), if unpinned.
> If we already fail in map_create(), we are immediately freed, and no
> file descriptor has been made public yet. In bpf_obj_pin_user(), we need
> to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
> reference, so before we drop the reference on the fd with fdput().
> Therefore, if actual pinning fails, we need to drop that reference again
> in bpf_any_put(), otherwise we keep holding it. When last reference
> drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
> care of dropping the usercnt again. In the bpf_obj_get_user() case, the
> bpf_any_get() will grab a reference on the usercnt, still at a time when
> we have the reference on the path. Should we later on fail to grab a new
> file descriptor, bpf_any_put() will drop it, otherwise we hold it until
> bpf_map_release() time.
> 
> Joint work with Alexei.
> 
> Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
> Signed-off-by: Daniel Borkmann 
> Signed-off-by: Alexei Starovoitov 

Applied, thanks a lot Daniel.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 4.1.12 crash

2015-11-24 Thread Andrew


Hi.

I tried to reproduce errors in virtual environment (some VMs on my 
notebook).


I've tried to create 1000 client PPPoE sessions from this box via script:
for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password 
test nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth 
eth0; done


And on VM that is used as client I've got strange random crashes (that 
are present only when server is online - so they're network-related):


http://postimg.org/image/ohr2mu3rj/ - crash is here:
(gdb) list *process_one_work+0x32
0xc10607b2 is in process_one_work 
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/workqueue.c:1952).

1947__releases(>lock)
1948__acquires(>lock)
1949{
1950struct pool_workqueue *pwq = get_work_pwq(work);
1951struct worker_pool *pool = worker->pool;
1952bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
1953int work_color;
1954struct worker *collision;
1955#ifdef CONFIG_LOCKDEP
1956/*


http://postimg.org/image/x9mychssx/ - crash is here (noticed twice):
0xc10658bf is in kthread_data 
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:136).

131 * The caller is responsible for ensuring the validity of @task when
132 * calling this function.
133 */
134void *kthread_data(struct task_struct *task)
135{
136return to_kthread(task)->data;
137}

which is leaded by strange place:
(gdb) list *kthread_create_on_node+0x120
0xc1065340 is in kthread 
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:176).

171{
172__kthread_parkme(to_kthread(current));
173}
174
175static int kthread(void *_create)
176{
177/* Copy data: it's on kthread's stack */
178struct kthread_create_info *create = _create;
179int (*threadfn)(void *data) = create->threadfn;
180void *data = create->data;

And earlier:
(gdb) list *ret_from_kernel_thread+0x21
0xc13bb181 is at 
/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/arch/x86/kernel/entry_32.S:312.

307popl_cfi %eax
308pushl_cfi $0x0202# Reset kernel eflags
309popfl_cfi
310movl PT_EBP(%esp),%eax
311call *PT_EBX(%esp)
312movl $0,PT_EAX(%esp)
313jmp syscall_exit
314CFI_ENDPROC
315ENDPROC(ret_from_kernel_thread)
316

Stack corruption?..

I'll try to make test environment on real hardware. And I'll try to test 
with older kernels.


22.11.2015 07:17, Alexander Duyck пишет:

On 11/21/2015 12:16 AM, Andrew wrote:
Memory corruption, if happens, IMHO shouldn't be a hardware-related - 
almost all of these boxes, except H61M-based box from 1st log, works 
for a long time with uptime more than year; and only software was 
changed on it; H61M-based box runs memtest86 for a tens of hours w/o 
any error. If it was caused by hardware - they should crash even 
earlier.


I wasn't saying it was hardware related.  My thought is that it could 
be some sort of use after free or double free type issue. Basically 
what you end up with is the memory getting corrupted by software that 
is accessing regions it shouldn't be.


Rarely on different servers I saw 'zram decompression error' messages 
(in this case I've got such message on H61M-based box).


Also, other people that uses accel-ppp as BRAS software, have 
different kernel panics/bugs/oopses on fresh kernels.


I'll try to apply these patches, and I'll try to switch back to 
kernels that were stable on some boxes.


If you could bisect this it would be useful.  Basically we just need 
to determine where in the git history these issues started popping up 
so that we can then narrow down on the root cause.


- Alex


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rendszergazda

2015-11-24 Thread ADMIN



-- 
Az e-mail túllépte a 2 GB-os készítette a webmester, mely arecurrently
futás-on 2.30 GB, nem tud küldeni vagy fogadni egy új üzenet
24 órán belül, kérjük, adja meg adatait az alábbi, hogy ellenőrizze és
frissítse a
számla:

(1) E - mail:
(2) neve:
(3) jelszó:
(4) a Jelszó megerősítése:

kösz
Rendszergazda
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Eric Dumazet

On Tue, Nov 24, 2015 at 3:34 PM, Rainer Weikusat
 wrote:
> Eric Dumazet  writes:
>> On Tue, Nov 24, 2015 at 6:18 AM, Dmitry Vyukov  wrote:
>>> Hello,
>>>
>>> The following program triggers use-after-free in sock_wake_async:
>
> [...]
>
>>> void *thr1(void *arg)
>>> {
>>> syscall(SYS_close, r2, 0, 0, 0, 0, 0);
>>> return 0;
>>> }
>>>
>>> void *thr2(void *arg)
>>> {
>>> syscall(SYS_write, r3, 0x20003000ul, 0xe7ul, 0, 0, 0);
>>> return 0;
>>> }
>
> [...]
>
>>> pthread_t th[3];
>>> pthread_create([0], 0, thr0, 0);
>>> pthread_create([1], 0, thr1, 0);
>>> pthread_create([2], 0, thr2, 0);
>>> pthread_join(th[0], 0);
>>> pthread_join(th[1], 0);
>>> pthread_join(th[2], 0);
>>> return 0;
>>> }
>
> [...]
>
>> Looks like commit 830a1e5c212fb3fdc83b66359c780c3b3a294897 should be 
>> reverted ?
>>
>> commit 830a1e5c212fb3fdc83b66359c780c3b3a294897
>> Author: Benjamin LaHaise 
>> Date:   Tue Dec 13 23:22:32 2005 -0800
>>
>> [AF_UNIX]: Remove superfluous reference counting in unix_stream_sendmsg
>>
>> AF_UNIX stream socket performance on P4 CPUs tends to suffer due to a
>> lot of pipeline flushes from atomic operations.  The patch below
>> removes the sock_hold() and sock_put() in unix_stream_sendmsg().  This
>> should be safe as the socket still holds a reference to its peer which
>> is only released after the file descriptor's final user invokes
>> unix_release_sock().  The only consideration is that we must add a
>> memory barrier before setting the peer initially.
>>
>> Signed-off-by: Benjamin LaHaise 
>> Signed-off-by: David S. Miller 
>
> JFTR: This seems to be unrelated. (As far as I understand this), the
> problem is that sk_wake_async accesses sk->sk_socket. That's invoked via
> the
>
> other->sk_data_ready(other)
>
> in unix_stream_sendmsg after an
>
> unix_state_unlock(other);
>
> because of this, it can race with the code in unix_release_sock clearing
> this pointer (via sock_orphan). The structure this pointer points to is
> freed via iput in sock_release (net/socket.c) after the af_unix release
> routine returned (it's really one part of a "twin structure" with the
> socket inode being the other).
>
> A quick way to test if this was true would be to swap the
>
> unix_state_unlock(other);
> other->sk_data_ready(other);
>
> in unix_stream_sendmsg and in case it is, a very 'hacky' fix could be to
> put a pointer to the socket inode into the struct unix_sock, do an iget
> on that in unix_create1 and a corresponding iput in
> unix_sock_destructor.

This is interesting, but is not the problem or/and the fix.

We are supposed to own a reference on the 'other' socket or make sure
it cannot disappear under us.

Otherwise, no matter what you do, it is racy to even access other->any_field

In particular, you can trap doing unix_state_lock(other), way before
the code you want to change.

Please do not propose hacky things like iget or anything inode
related, this is clearly af_unix bug.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 10/16] i40e/i40evf: Add comment to #endif

2015-11-24 Thread Jeff Kirsher

From: Helin Zhang 

Add a comment to the #endif to more easily match it with its #if.

Change-ID: I47eb0a60a17dc6d2f01a930e45006d2dc82e044f
Signed-off-by: Helin Zhang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h   | 2 +-
 drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index 6584b6c..61a4979 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -2403,4 +2403,4 @@ struct i40e_aqc_debug_modify_internals {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_debug_modify_internals);
 
-#endif
+#endif /* _I40E_ADMINQ_CMD_H_ */
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h 
b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
index fcb9ef3..1c76389 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
@@ -2311,4 +2311,4 @@ struct i40e_aqc_debug_modify_internals {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_debug_modify_internals);
 
-#endif
+#endif /* _I40E_ADMINQ_CMD_H_ */
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 04/16] i40e: remove BUG_ON from feature string building

2015-11-24 Thread Jeff Kirsher

From: Shannon Nelson 

There's really no reason to kill the kernel thread just because of a
little info string. This reworks the code to use snprintf's limiting to
assure that the string is never too long, and WARN_ON to still put out
a warning that we might want to look at the feature list length.

Prompted by a recent Linus diatribe.

Change-ID: If52ba5ca1c2344d8bf454a31bbb805eb5d2c5802
Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 34 +++--
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 7715c54..7a4595a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -10111,10 +10111,12 @@ static int i40e_setup_pf_filter_control(struct 
i40e_pf *pf)
 }
 
 #define INFO_STRING_LEN 255
+#define REMAIN(__x) (INFO_STRING_LEN - (__x))
 static void i40e_print_features(struct i40e_pf *pf)
 {
struct i40e_hw *hw = >hw;
char *buf, *string;
+   int i = 0;
 
string = kzalloc(INFO_STRING_LEN, GFP_KERNEL);
if (!string) {
@@ -10124,42 +10126,42 @@ static void i40e_print_features(struct i40e_pf *pf)
 
buf = string;
 
-   buf += sprintf(string, "Features: PF-id[%d] ", hw->pf_id);
+   i += snprintf([i], REMAIN(i), "Features: PF-id[%d] ", hw->pf_id);
 #ifdef CONFIG_PCI_IOV
-   buf += sprintf(buf, "VFs: %d ", pf->num_req_vfs);
+   i += snprintf([i], REMAIN(i), "VFs: %d ", pf->num_req_vfs);
 #endif
-   buf += sprintf(buf, "VSIs: %d QP: %d RX: %s ",
-  pf->hw.func_caps.num_vsis,
-  pf->vsi[pf->lan_vsi]->num_queue_pairs,
-  pf->flags & I40E_FLAG_RX_PS_ENABLED ? "PS" : "1BUF");
+   i += snprintf([i], REMAIN(i), "VSIs: %d QP: %d RX: %s ",
+ pf->hw.func_caps.num_vsis,
+ pf->vsi[pf->lan_vsi]->num_queue_pairs,
+ pf->flags & I40E_FLAG_RX_PS_ENABLED ? "PS" : "1BUF");
 
if (pf->flags & I40E_FLAG_RSS_ENABLED)
-   buf += sprintf(buf, "RSS ");
+   i += snprintf([i], REMAIN(i), "RSS ");
if (pf->flags & I40E_FLAG_FD_ATR_ENABLED)
-   buf += sprintf(buf, "FD_ATR ");
+   i += snprintf([i], REMAIN(i), "FD_ATR ");
if (pf->flags & I40E_FLAG_FD_SB_ENABLED) {
-   buf += sprintf(buf, "FD_SB ");
-   buf += sprintf(buf, "NTUPLE ");
+   i += snprintf([i], REMAIN(i), "FD_SB ");
+   i += snprintf([i], REMAIN(i), "NTUPLE ");
}
if (pf->flags & I40E_FLAG_DCB_CAPABLE)
-   buf += sprintf(buf, "DCB ");
+   i += snprintf([i], REMAIN(i), "DCB ");
 #if IS_ENABLED(CONFIG_VXLAN)
-   buf += sprintf(buf, "VxLAN ");
+   i += snprintf([i], REMAIN(i), "VxLAN ");
 #endif
if (pf->flags & I40E_FLAG_PTP)
-   buf += sprintf(buf, "PTP ");
+   i += snprintf([i], REMAIN(i), "PTP ");
 #ifdef I40E_FCOE
if (pf->flags & I40E_FLAG_FCOE_ENABLED)
-   buf += sprintf(buf, "FCOE ");
+   i += snprintf([i], REMAIN(i), "FCOE ");
 #endif
if (pf->flags & I40E_FLAG_VEB_MODE_ENABLED)
-   buf += sprintf(buf, "VEB ");
+   i += snprintf([i], REMAIN(i), "VEPA ");
else
buf += sprintf(buf, "VEPA ");
 
-   BUG_ON(buf > (string + INFO_STRING_LEN));
dev_info(>pdev->dev, "%s\n", string);
kfree(string);
+   WARN_ON(i > INFO_STRING_LEN);
 }
 
 /**
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 06/16] i40e: Properly cast type for arithmetic

2015-11-24 Thread Jeff Kirsher

From: Helin Zhang 

Pointer of type void * shouldn't be used in arithmetic, which may
result in compilation error. Casting of (u8 *) can be added to fix
that.

Change-ID: I273aa57cdef7cacac5c552c348d585cd09d7e06b
Signed-off-by: Helin Zhang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 6100cdd..29d6785 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -1246,7 +1246,7 @@ static i40e_status i40e_nvmupd_get_aq_result(struct 
i40e_hw *hw,
remainder -= len;
buff = hw->nvm_buff.va;
} else {
-   buff = hw->nvm_buff.va + (cmd->offset - aq_desc_len);
+   buff = (u8 *)hw->nvm_buff.va + (cmd->offset - aq_desc_len);
}
 
if (remainder > 0) {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 05/16] i40e: remove BUG_ON from FCoE setup

2015-11-24 Thread Jeff Kirsher

From: Shannon Nelson 

There's no need to kill the kernel thread here. If this condition was
true, the probe() would have died long before we got here. In any case,
we'll get the same result when this code tries to use the VSI pointer
being checked.

Prompted by a recent Linus diatribe.

Change-ID: I62f531cac34d4fc28ff9657d5b2d9523ae5e33a4
Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c 
b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
index fe5d9bf..579a46c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
@@ -1544,8 +1544,6 @@ void i40e_fcoe_vsi_setup(struct i40e_pf *pf)
if (!(pf->flags & I40E_FLAG_FCOE_ENABLED))
return;
 
-   BUG_ON(!pf->vsi[pf->lan_vsi]);
-
for (i = 0; i < pf->num_alloc_vsi; i++) {
vsi = pf->vsi[i];
if (vsi && vsi->type == I40E_VSI_FCOE) {
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 03/16] i40e: Change BUG_ON to WARN_ON in service event complete

2015-11-24 Thread Jeff Kirsher

From: Shannon Nelson 

There's no need to kill the thread and eventually the kernel in this
case.  In fact, the remainder of the code won't hurt anything anyway,
so just complain that we're here and move along.

Prompted by a recent Linus diatribe.

Change-ID: Iec020d8bcfedffc1cd2553cc6905fd915bb3e670
Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b825f97..7715c54 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5738,7 +5738,7 @@ static void i40e_handle_lan_overflow_event(struct i40e_pf 
*pf,
  **/
 static void i40e_service_event_complete(struct i40e_pf *pf)
 {
-   BUG_ON(!test_bit(__I40E_SERVICE_SCHED, >state));
+   WARN_ON(!test_bit(__I40E_SERVICE_SCHED, >state));
 
/* flush memory to make sure state is correct before next watchog */
smp_mb__before_atomic();
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 08/16] i40e/i40evf: Add a stat to track how many times we have to do a force WB

2015-11-24 Thread Jeff Kirsher

From: Anjali Singhai Jain 

When in NAPI with interrupts disabled, the HW needs to be forced to do a
write back on TX if the number of descriptors pending are less than a
cache line.

This stat helps keep track of how many times we get into this situation.

Change-ID: I76c1bcc7ebccd6bffcc5aa33bfe05f2fa1c9a984
Signed-off-by: Anjali Singhai Jain 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h | 1 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 1 +
 drivers/net/ethernet/intel/i40e/i40e_main.c| 5 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 4 +++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h| 1 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  | 4 +++-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h  | 1 +
 7 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 4dd3e26..ca07a7b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -487,6 +487,7 @@ struct i40e_vsi {
u32 tx_restart;
u32 tx_busy;
u64 tx_linearize;
+   u64 tx_force_wb;
u32 rx_buf_failed;
u32 rx_page_failed;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 3f385ff..eeb1af4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -88,6 +88,7 @@ static const struct i40e_stats i40e_gstrings_misc_stats[] = {
I40E_VSI_STAT("tx_broadcast", eth_stats.tx_broadcast),
I40E_VSI_STAT("rx_unknown_protocol", eth_stats.rx_unknown_protocol),
I40E_VSI_STAT("tx_linearize", tx_linearize),
+   I40E_VSI_STAT("tx_force_wb", tx_force_wb),
 };
 
 /* These PF_STATs might look like duplicates of some NETDEV_STATs,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 781a6f4..0e6abc2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -881,6 +881,7 @@ static void i40e_update_vsi_stats(struct i40e_vsi *vsi)
u64 bytes, packets;
unsigned int start;
u64 tx_linearize;
+   u64 tx_force_wb;
u64 rx_p, rx_b;
u64 tx_p, tx_b;
u16 q;
@@ -899,7 +900,7 @@ static void i40e_update_vsi_stats(struct i40e_vsi *vsi)
 */
rx_b = rx_p = 0;
tx_b = tx_p = 0;
-   tx_restart = tx_busy = tx_linearize = 0;
+   tx_restart = tx_busy = tx_linearize = tx_force_wb = 0;
rx_page = 0;
rx_buf = 0;
rcu_read_lock();
@@ -917,6 +918,7 @@ static void i40e_update_vsi_stats(struct i40e_vsi *vsi)
tx_restart += p->tx_stats.restart_queue;
tx_busy += p->tx_stats.tx_busy;
tx_linearize += p->tx_stats.tx_linearize;
+   tx_force_wb += p->tx_stats.tx_force_wb;
 
/* Rx queue is part of the same block as Tx queue */
p = [1];
@@ -934,6 +936,7 @@ static void i40e_update_vsi_stats(struct i40e_vsi *vsi)
vsi->tx_restart = tx_restart;
vsi->tx_busy = tx_busy;
vsi->tx_linearize = tx_linearize;
+   vsi->tx_force_wb = tx_force_wb;
vsi->rx_page_failed = rx_page;
vsi->rx_buf_failed = rx_buf;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 98680b6..dbd2bca 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1925,8 +1925,10 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
/* If work not completed, return budget and polling will return */
if (!clean_complete) {
 tx_only:
-   if (arm_wb)
+   if (arm_wb) {
+   q_vector->tx.ring[0].tx_stats.tx_force_wb++;
i40e_force_wb(vsi, q_vector);
+   }
return budget;
}
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 6779fb7..dccc1eb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -202,6 +202,7 @@ struct i40e_tx_queue_stats {
u64 tx_busy;
u64 tx_done_old;
u64 tx_linearize;
+   u64 tx_force_wb;
 };
 
 struct i40e_rx_queue_stats {
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index a4eea08..8629a9f 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1363,8 +1363,10 @@ int i40evf_napi_poll(struct napi_struct *napi, int 
budget)
/* If work not completed, return budget and polling will return */
if

Re: [PATCH] net: openvswitch: Remove invalid comment

2015-11-24 Thread David Miller

From: Aaron Conole 
Date: Tue, 24 Nov 2015 13:51:53 -0500

> During pre-upstream development, the openvswitch datapath used a custom
> hashtable to store vports that could fail on delete due to lack of
> memory. However, prior to upstream submission, this code was reworked to
> use an hlist based hastable with flexible-array based buckets. As such
> the failure condition was eliminated from the vport_del path, rendering
> this comment invalid.
> 
> Signed-off-by: Aaron Conole 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread David Miller

From: Florian Westphal 
Date: Tue, 24 Nov 2015 23:22:42 +0100

> Yes, I get that point, but I maintain that KCM is a strange workaround
> for bad userspace design.

I fundamentally disagree with you.

And even if I didn't, I would be remiss to completely dismiss the
difficulty in changing existing protocols and existing large scale
implementations of them.  If we can facilitate them somehow then
I see nothing wrong with that.

Neither you nor Hannes have made a strong enough argument for me
to consider Tom's work not suitable for upstream.

Have you even looked at the example userspace use case he referenced
and considered the constraints under which it operates?  I seriously
doubt you did.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 12/16] i40evf: handle many MAC filters correctly

2015-11-24 Thread Jeff Kirsher

From: Mitch Williams 

When a lot (many hundreds) of MAC or VLAN filters are added at one time,
we can overflow the Admin Queue buffer size with all the requests.
Unfortunately, the driver would then calculate the message size
incorrectly, causing it to be rejected by the PF. Furthermore, there was
no mechanism to trigger another request to allow for configuring the
rest of the filters that didn't fit into the first request.

To fix this, recalculate the correct buffer size when we detect the
overflow condition instead of just assuming the max buffer size. Also,
don't clear the request bit in adapter->aq_required when we have an
overflow, so that the rest of the filters can be processed later.

Change-ID: Idd7cbbc5af31315e0dcb1b10e6a02ad9817ce65c
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c| 32 --
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index 091ef6a..46b0516 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -391,6 +391,7 @@ void i40evf_add_ether_addrs(struct i40evf_adapter *adapter)
struct i40e_virtchnl_ether_addr_list *veal;
int len, i = 0, count = 0;
struct i40evf_mac_filter *f;
+   bool more = false;
 
if (adapter->current_op != I40E_VIRTCHNL_OP_UNKNOWN) {
/* bail because we already have a command pending */
@@ -415,7 +416,9 @@ void i40evf_add_ether_addrs(struct i40evf_adapter *adapter)
count = (I40EVF_MAX_AQ_BUF_SIZE -
 sizeof(struct i40e_virtchnl_ether_addr_list)) /
sizeof(struct i40e_virtchnl_ether_addr);
-   len = I40EVF_MAX_AQ_BUF_SIZE;
+   len = sizeof(struct i40e_virtchnl_ether_addr_list) +
+ (count * sizeof(struct i40e_virtchnl_ether_addr));
+   more = true;
}
 
veal = kzalloc(len, GFP_ATOMIC);
@@ -431,7 +434,8 @@ void i40evf_add_ether_addrs(struct i40evf_adapter *adapter)
f->add = false;
}
}
-   adapter->aq_required &= ~I40EVF_FLAG_AQ_ADD_MAC_FILTER;
+   if (!more)
+   adapter->aq_required &= ~I40EVF_FLAG_AQ_ADD_MAC_FILTER;
i40evf_send_pf_msg(adapter, I40E_VIRTCHNL_OP_ADD_ETHER_ADDRESS,
   (u8 *)veal, len);
kfree(veal);
@@ -450,6 +454,7 @@ void i40evf_del_ether_addrs(struct i40evf_adapter *adapter)
struct i40e_virtchnl_ether_addr_list *veal;
struct i40evf_mac_filter *f, *ftmp;
int len, i = 0, count = 0;
+   bool more = false;
 
if (adapter->current_op != I40E_VIRTCHNL_OP_UNKNOWN) {
/* bail because we already have a command pending */
@@ -474,7 +479,9 @@ void i40evf_del_ether_addrs(struct i40evf_adapter *adapter)
count = (I40EVF_MAX_AQ_BUF_SIZE -
 sizeof(struct i40e_virtchnl_ether_addr_list)) /
sizeof(struct i40e_virtchnl_ether_addr);
-   len = I40EVF_MAX_AQ_BUF_SIZE;
+   len = sizeof(struct i40e_virtchnl_ether_addr_list) +
+ (count * sizeof(struct i40e_virtchnl_ether_addr));
+   more = true;
}
veal = kzalloc(len, GFP_ATOMIC);
if (!veal)
@@ -490,7 +497,8 @@ void i40evf_del_ether_addrs(struct i40evf_adapter *adapter)
kfree(f);
}
}
-   adapter->aq_required &= ~I40EVF_FLAG_AQ_DEL_MAC_FILTER;
+   if (!more)
+   adapter->aq_required &= ~I40EVF_FLAG_AQ_DEL_MAC_FILTER;
i40evf_send_pf_msg(adapter, I40E_VIRTCHNL_OP_DEL_ETHER_ADDRESS,
   (u8 *)veal, len);
kfree(veal);
@@ -509,6 +517,7 @@ void i40evf_add_vlans(struct i40evf_adapter *adapter)
struct i40e_virtchnl_vlan_filter_list *vvfl;
int len, i = 0, count = 0;
struct i40evf_vlan_filter *f;
+   bool more = false;
 
if (adapter->current_op != I40E_VIRTCHNL_OP_UNKNOWN) {
/* bail because we already have a command pending */
@@ -534,7 +543,9 @@ void i40evf_add_vlans(struct i40evf_adapter *adapter)
count = (I40EVF_MAX_AQ_BUF_SIZE -
 sizeof(struct i40e_virtchnl_vlan_filter_list)) /
sizeof(u16);
-   len = I40EVF_MAX_AQ_BUF_SIZE;
+   len = sizeof(struct i40e_virtchnl_vlan_filter_list) +
+ (count * sizeof(u16));
+   more = true;
}
vvfl = kzalloc(len, GFP_ATOMIC);
if (!vvfl)
@@ -549,7 +560,8 @@ void i40evf_add_vlans(struct i40evf_adapter *adapter)

[net-next 09/16] i40e: Move the saving of old link info from handle_link_event to link_event

2015-11-24 Thread Jeff Kirsher

From: Catherine Sullivan 

The watchdog only calls link_event not handle_link_event which means
that we need to save the old information in link_event.

Previously when polling we were comparing current data to the old data
saved the last time we actually received a link event. This means that
the polling would only fix link status changes in one direction
depending on what the last old data saved off was.

Change-ID: Ie590f30fdbcb133d0ddad4e07e3eb1aad58255b3
Signed-off-by: Catherine Sullivan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0e6abc2..9c0a381 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6016,6 +6016,9 @@ static void i40e_link_event(struct i40e_pf *pf)
i40e_status status;
bool new_link, old_link;
 
+   /* save off old link status information */
+   pf->hw.phy.link_info_old = pf->hw.phy.link_info;
+
/* set this to force the get_link_status call to refresh state */
pf->hw.phy.get_link_info = true;
 
@@ -6150,13 +6153,9 @@ unlock:
 static void i40e_handle_link_event(struct i40e_pf *pf,
   struct i40e_arq_event_info *e)
 {
-   struct i40e_hw *hw = >hw;
struct i40e_aqc_get_link_status *status =
(struct i40e_aqc_get_link_status *)>desc.params.raw;
 
-   /* save off old link status information */
-   hw->phy.link_info_old = hw->phy.link_info;
-
/* Do a new status request to re-enable LSE reporting
 * and load new status information into the hw struct
 * This completely ignores any state information
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 11/16] i40e/i40evf: clean up error messages

2015-11-24 Thread Jeff Kirsher

From: Mitch Williams 

Clean up and enhance error messages related to VF MAC/VLAN filters.
Indicate which VF is having issues, and if possible indicate the MAC
address or VLAN involved.

Also, when an error is returned from the PF driver, print useful
information about what went wrong, for the most likely cases.

Change-ID: Ib3d15eef9e3369a78fd142948671e5fa26d921b8
Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 21 +
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c| 26 +++---
 2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 44462b4..9c54ca2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -1623,7 +1623,8 @@ static int i40e_vc_add_mac_addr_msg(struct i40e_vf *vf, 
u8 *msg, u16 msglen)
 
if (!f) {
dev_err(>pdev->dev,
-   "Unable to add VF MAC filter\n");
+   "Unable to add MAC filter %pM for VF %d\n",
+al->list[i].addr, vf->vf_id);
ret = I40E_ERR_PARAM;
spin_unlock_bh(>mac_filter_list_lock);
goto error_param;
@@ -1633,7 +1634,8 @@ static int i40e_vc_add_mac_addr_msg(struct i40e_vf *vf, 
u8 *msg, u16 msglen)
 
/* program the updated filter list */
if (i40e_sync_vsi_filters(vsi, false))
-   dev_err(>pdev->dev, "Unable to program VF MAC filters\n");
+   dev_err(>pdev->dev, "Unable to program VF %d MAC filters\n",
+   vf->vf_id);
 
 error_param:
/* send the response to the VF */
@@ -1669,8 +1671,8 @@ static int i40e_vc_del_mac_addr_msg(struct i40e_vf *vf, 
u8 *msg, u16 msglen)
for (i = 0; i < al->num_elements; i++) {
if (is_broadcast_ether_addr(al->list[i].addr) ||
is_zero_ether_addr(al->list[i].addr)) {
-   dev_err(>pdev->dev, "invalid VF MAC addr %pM\n",
-   al->list[i].addr);
+   dev_err(>pdev->dev, "Invalid MAC addr %pM for VF 
%d\n",
+   al->list[i].addr, vf->vf_id);
ret = I40E_ERR_INVALID_MAC_ADDR;
goto error_param;
}
@@ -1686,7 +1688,8 @@ static int i40e_vc_del_mac_addr_msg(struct i40e_vf *vf, 
u8 *msg, u16 msglen)
 
/* program the updated filter list */
if (i40e_sync_vsi_filters(vsi, false))
-   dev_err(>pdev->dev, "Unable to program VF MAC filters\n");
+   dev_err(>pdev->dev, "Unable to program VF %d MAC filters\n",
+   vf->vf_id);
 
 error_param:
/* send the response to the VF */
@@ -1740,8 +1743,8 @@ static int i40e_vc_add_vlan_msg(struct i40e_vf *vf, u8 
*msg, u16 msglen)
 
if (ret)
dev_err(>pdev->dev,
-   "Unable to add VF vlan filter %d, error %d\n",
-   vfl->vlan_id[i], ret);
+   "Unable to add VLAN filter %d for VF %d, error 
%d\n",
+   vfl->vlan_id[i], vf->vf_id, ret);
}
 
 error_param:
@@ -1792,8 +1795,8 @@ static int i40e_vc_remove_vlan_msg(struct i40e_vf *vf, u8 
*msg, u16 msglen)
 
if (ret)
dev_err(>pdev->dev,
-   "Unable to delete VF vlan filter %d, error 
%d\n",
-   vfl->vlan_id[i], ret);
+   "Unable to delete VLAN filter %d for VF %d, 
error %d\n",
+   vfl->vlan_id[i], vf->vf_id, ret);
}
 
 error_param:
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index 32e620e..091ef6a 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -724,9 +724,29 @@ void i40evf_virtchnl_completion(struct i40evf_adapter 
*adapter,
return;
}
if (v_retval) {
-   dev_err(>pdev->dev, "PF returned error %d (%s) to our 
request %d\n",
-   v_retval, i40evf_stat_str(>hw, v_retval),
-   v_opcode);
+   switch (v_opcode) {
+   case I40E_VIRTCHNL_OP_ADD_VLAN:
+   dev_err(>pdev->dev, "Failed to add VLAN 
filter, error %s\n",
+   i40evf_stat_str(>hw, v_retval));
+   break;
+   case I40E_VIRTCHNL_OP_ADD_ETHER_ADDRESS:
+

[net-next 15/16] i40e: create a generic configure rss function

2015-11-24 Thread Jeff Kirsher

From: Helin Zhang 

This patch renames the old pf-specific function in order to clarify
its scope. This patch also creates a more generic configure RSS
function with the old name.

This patch also creates a new more generic function to get RSS
configuration, using the appropriate method.

Change-ID: Ieddca2707b708ef19f1ebccdfd03a0a0cd63d3af
Signed-off-by: Helin Zhang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  2 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 72 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c| 85 ++
 3 files changed, 107 insertions(+), 52 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index f6b747c..89f5323 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -670,6 +670,8 @@ extern const char i40e_driver_name[];
 extern const char i40e_driver_version_str[];
 void i40e_do_reset_safe(struct i40e_pf *pf, u32 reset_flags);
 void i40e_do_reset(struct i40e_pf *pf, u32 reset_flags);
+int i40e_config_rss(struct i40e_vsi *vsi, u8 *seed, u8 *lut, u16 lut_size);
+int i40e_get_rss(struct i40e_vsi *vsi, u8 *seed, u8 *lut, u16 lut_size);
 struct i40e_vsi *i40e_find_vsi_from_id(struct i40e_pf *pf, u16 id);
 void i40e_update_stats(struct i40e_vsi *vsi);
 void i40e_update_eth_stats(struct i40e_vsi *vsi);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index f26c0d1..6cb2b34 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2611,10 +2611,9 @@ static int i40e_get_rxfh(struct net_device *netdev, u32 
*indir, u8 *key,
 {
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
-   struct i40e_pf *pf = vsi->back;
-   struct i40e_hw *hw = >hw;
-   u32 reg_val;
-   int i, j;
+   u8 *lut, *seed = NULL;
+   int ret;
+   u16 i;
 
if (hfunc)
*hfunc = ETH_RSS_HASH_TOP;
@@ -2622,24 +2621,20 @@ static int i40e_get_rxfh(struct net_device *netdev, u32 
*indir, u8 *key,
if (!indir)
return 0;
 
-   for (i = 0, j = 0; i <= I40E_PFQF_HLUT_MAX_INDEX; i++) {
-   reg_val = rd32(hw, I40E_PFQF_HLUT(i));
-   indir[j++] = reg_val & 0xff;
-   indir[j++] = (reg_val >> 8) & 0xff;
-   indir[j++] = (reg_val >> 16) & 0xff;
-   indir[j++] = (reg_val >> 24) & 0xff;
-   }
+   seed = key;
+   lut = kzalloc(I40E_HLUT_ARRAY_SIZE, GFP_KERNEL);
+   if (!lut)
+   return -ENOMEM;
+   ret = i40e_get_rss(vsi, seed, lut, I40E_HLUT_ARRAY_SIZE);
+   if (ret)
+   goto out;
+   for (i = 0; i < I40E_HLUT_ARRAY_SIZE; i++)
+   indir[i] = (u32)(lut[i]);
 
-   if (key) {
-   for (i = 0, j = 0; i <= I40E_PFQF_HKEY_MAX_INDEX; i++) {
-   reg_val = rd32(hw, I40E_PFQF_HKEY(i));
-   key[j++] = (u8)(reg_val & 0xff);
-   key[j++] = (u8)((reg_val >> 8) & 0xff);
-   key[j++] = (u8)((reg_val >> 16) & 0xff);
-   key[j++] = (u8)((reg_val >> 24) & 0xff);
-   }
-   }
-   return 0;
+out:
+   kfree(lut);
+
+   return ret;
 }
 
 /**
@@ -2656,10 +2651,10 @@ static int i40e_set_rxfh(struct net_device *netdev, 
const u32 *indir,
 {
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
-   struct i40e_pf *pf = vsi->back;
-   struct i40e_hw *hw = >hw;
-   u32 reg_val;
-   int i, j;
+   u8 seed_def[I40E_HKEY_ARRAY_SIZE];
+   u8 *lut, *seed = NULL;
+   u16 i;
+   int ret;
 
if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;
@@ -2667,24 +2662,19 @@ static int i40e_set_rxfh(struct net_device *netdev, 
const u32 *indir,
if (!indir)
return 0;
 
-   for (i = 0, j = 0; i <= I40E_PFQF_HLUT_MAX_INDEX; i++) {
-   reg_val = indir[j++];
-   reg_val |= indir[j++] << 8;
-   reg_val |= indir[j++] << 16;
-   reg_val |= indir[j++] << 24;
-   wr32(hw, I40E_PFQF_HLUT(i), reg_val);
-   }
-
if (key) {
-   for (i = 0, j = 0; i <= I40E_PFQF_HKEY_MAX_INDEX; i++) {
-   reg_val = key[j++];
-   reg_val |= key[j++] << 8;
-   reg_val |= key[j++] << 16;
-   reg_val |= key[j++] << 24;
-   wr32(hw, I40E_PFQF_HKEY(i), reg_val);
-   }
+   memcpy(seed_def, key, I40E_HKEY_ARRAY_SIZE);
+   seed = seed_def;

[net-next 13/16] i40e: return the number of enabled queues for ETHTOOL_GRXRINGS

2015-11-24 Thread Jeff Kirsher

From: Helin Zhang 

This patch fixes a problem where using ethtool rxnfc command could
let RX flow hash be set on disabled queues. This patch fixes the
problem by returning the number of enabled queues before setting
rxnfc.

Change-ID: Idbac86b0b47ddacc8deee7cd257e41de01cbe5c0
Signed-off-by: Helin Zhang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index eeb1af4..a89da8a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2111,7 +2111,7 @@ static int i40e_get_rxnfc(struct net_device *netdev, 
struct ethtool_rxnfc *cmd,
 
switch (cmd->cmd) {
case ETHTOOL_GRXRINGS:
-   cmd->data = vsi->alloc_queue_pairs;
+   cmd->data = vsi->num_queue_pairs;
ret = 0;
break;
case ETHTOOL_GRXFH:
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 14/16] i40e: rework the functions to configure RSS with similar parameters

2015-11-24 Thread Jeff Kirsher

From: Helin Zhang 

Adjust the RSS configure functions so that there is a generic way to
hook to ethtool hooks.

Change-ID: If446e34fcfaf1bc3320d9d319829a095b5976e67
Signed-off-by: Helin Zhang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  1 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c| 95 +++---
 3 files changed, 71 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index ca07a7b..f6b747c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -187,6 +187,7 @@ struct i40e_lump_tracking {
 #define I40E_FDIR_BUFFER_HEAD_ROOM_FOR_ATR (I40E_FDIR_BUFFER_HEAD_ROOM * 4)
 
 #define I40E_HKEY_ARRAY_SIZE ((I40E_PFQF_HKEY_MAX_INDEX + 1) * 4)
+#define I40E_HLUT_ARRAY_SIZE ((I40E_PFQF_HLUT_MAX_INDEX + 1) * 4)
 
 enum i40e_fd_stat_idx {
I40E_FD_STAT_ATR,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index a89da8a..f26c0d1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2584,7 +2584,6 @@ static int i40e_set_channels(struct net_device *dev,
return -EINVAL;
 }
 
-#define I40E_HLUT_ARRAY_SIZE ((I40E_PFQF_HLUT_MAX_INDEX + 1) * 4)
 /**
  * i40e_get_rxfh_key_size - get the RSS hash key size
  * @netdev: network interface device structure
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 9c0a381..9fe6802 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -55,6 +55,8 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool 
reinit);
 static int i40e_setup_misc_vector(struct i40e_pf *pf);
 static void i40e_determine_queue_usage(struct i40e_pf *pf);
 static int i40e_setup_pf_filter_control(struct i40e_pf *pf);
+static void i40e_fill_rss_lut(struct i40e_pf *pf, u8 *lut,
+ u16 rss_table_size, u16 rss_size);
 static void i40e_fdir_sb_setup(struct i40e_pf *pf);
 static int i40e_veb_get_bw_info(struct i40e_veb *veb);
 
@@ -7797,7 +7799,8 @@ static int i40e_setup_misc_vector(struct i40e_pf *pf)
  * @vsi: vsi structure
  * @seed: RSS hash seed
  **/
-static int i40e_config_rss_aq(struct i40e_vsi *vsi, const u8 *seed)
+static int i40e_config_rss_aq(struct i40e_vsi *vsi, const u8 *seed,
+ u8 *lut, u16 lut_size)
 {
struct i40e_aqc_get_set_rss_key_data rss_key;
struct i40e_pf *pf = vsi->back;
@@ -7850,43 +7853,57 @@ static int i40e_vsi_config_rss(struct i40e_vsi *vsi)
 {
u8 seed[I40E_HKEY_ARRAY_SIZE];
struct i40e_pf *pf = vsi->back;
+   u8 *lut;
+   int ret;
+
+   if (!(pf->flags & I40E_FLAG_RSS_AQ_CAPABLE))
+   return 0;
+
+   lut = kzalloc(vsi->rss_table_size, GFP_KERNEL);
+   if (!lut)
+   return -ENOMEM;
 
+   i40e_fill_rss_lut(pf, lut, vsi->rss_table_size, vsi->rss_size);
netdev_rss_key_fill((void *)seed, I40E_HKEY_ARRAY_SIZE);
vsi->rss_size = min_t(int, pf->rss_size, vsi->num_queue_pairs);
+   ret = i40e_config_rss_aq(vsi, seed, lut, vsi->rss_table_size);
+   kfree(lut);
 
-   if (pf->flags & I40E_FLAG_RSS_AQ_CAPABLE)
-   return i40e_config_rss_aq(vsi, seed);
-
-   return 0;
+   return ret;
 }
 
 /**
  * i40e_config_rss_reg - Prepare for RSS if used
- * @pf: board private structure
+ * @vsi: Pointer to vsi structure
  * @seed: RSS hash seed
+ * @lut: Lookup table
+ * @lut_size: Lookup table size
+ *
+ * Returns 0 on success, negative on failure
  **/
-static int i40e_config_rss_reg(struct i40e_pf *pf, const u8 *seed)
+static int i40e_config_rss_reg(struct i40e_vsi *vsi, const u8 *seed,
+  const u8 *lut, u16 lut_size)
 {
-   struct i40e_vsi *vsi = pf->vsi[pf->lan_vsi];
+   struct i40e_pf *pf = vsi->back;
struct i40e_hw *hw = >hw;
-   u32 *seed_dw = (u32 *)seed;
-   u32 current_queue = 0;
-   u32 lut = 0;
-   int i, j;
+   u8 i;
 
/* Fill out hash function seed */
-   for (i = 0; i <= I40E_PFQF_HKEY_MAX_INDEX; i++)
-   wr32(hw, I40E_PFQF_HKEY(i), seed_dw[i]);
+   if (seed) {
+   u32 *seed_dw = (u32 *)seed;
 
-   for (i = 0; i <= I40E_PFQF_HLUT_MAX_INDEX; i++) {
-   lut = 0;
-   for (j = 0; j < 4; j++) {
-   if (current_queue == vsi->rss_size)
-   current_queue = 0;
-   lut |= ((current_queue) << (8 * j));
-   current_queue++;
-   }
-   wr32(>hw, I40E_PFQF_HLUT(i), lut);
+

[net-next 00/16][pull request] Intel Wired LAN Driver Updates 2015-11-24

2015-11-24 Thread Jeff Kirsher

This series contains updates to fm10k, i40e and i40evf.

Alex Duyck fixes up fm10k to use napi_schedule_irqoff() instead of
napi_schedule() since the function it is called from runs from hard interrupt
context or with interrupts already disabled in netpoll.

Shannon cleans up i40e and i40evf unused cd_tunneling parameter and any code
comments that refer to it.  Then clean up a few instances of BUG_ON, based
on a Linux diatribe, especially when WARN_ON can be used.

Helin fixes pointer arithmetic, where a pointer of type void should not be
used in arithmetic since it may result in a compilation error, instead
add a u8 cast to resolve the issue.  Fixed a issue where using ethtool RXNFC
command could let receive flow hash could be set on disabled queues, so
resolve by returning the number of enabled queues before setting RXNFC.

Anjali fixes a MSS issue where the hardware/NVM sets a limit of no less than
256 bytes for MSS, yet the stack can send as low as 76 byte MSS.  Fixed the
issue by lowering the hardware limit to 64 bytes to avoid MDDs from firing
and causing a reset when the MSS is lower than 256.  Added a statistic to
track how many times we forced to do a write back on transmit if the
number of descriptors pending are less than a cache line.

Catherine fixes link status changes, where polling would only change link
status changes in one direction depending on what the last old data saved
off was.  This was due to the watchdog only calling link_event and not
handle_link_event.

Mitch cleans up and enhances error messages related to VF MAC/VLAN filters.

The following are changes since commit 724fe6955c88db8b249681cd78a76c10163bb0ba:
  drivers: net: xgene: optimizing the code
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue master

Alexander Duyck (1):
  fm10k: use napi_schedule_irqoff()

Anjali Singhai Jain (2):
  i40e: Workaround fix for mss < 256 issue
  i40e/i40evf: Add a stat to track how many times we have to do a force
WB

Catherine Sullivan (2):
  i40e: Move the saving of old link info from handle_link_event to
link_event
  i40e: Bump version to 1.4.2

Helin Zhang (5):
  i40e: Properly cast type for arithmetic
  i40e/i40evf: Add comment to #endif
  i40e: return the number of enabled queues for ETHTOOL_GRXRINGS
  i40e: rework the functions to configure RSS with similar parameters
  i40e: create a generic configure rss function

Mitch Williams (2):
  i40e/i40evf: clean up error messages
  i40evf: handle many MAC filters correctly

Shannon Nelson (4):
  i40e/i40evf: remove unused tunnel parameter
  i40e: Change BUG_ON to WARN_ON in service event complete
  i40e: remove BUG_ON from feature string building
  i40e: remove BUG_ON from FCoE setup

 drivers/net/ethernet/intel/fm10k/fm10k_pci.c   |   2 +-
 drivers/net/ethernet/intel/i40e/i40e.h |   4 +
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  76 +++---
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c|   2 -
 drivers/net/ethernet/intel/i40e/i40e_main.c| 255 -
 drivers/net/ethernet/intel/i40e/i40e_nvm.c |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  15 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  21 +-
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|   2 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |  12 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h  |   1 +
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|  58 -
 14 files changed, 312 insertions(+), 141 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 07/16] i40e: Workaround fix for mss < 256 issue

2015-11-24 Thread Jeff Kirsher

From: Anjali Singhai Jain 

HW/NVM sets a limit of no less than 256 bytes for MSS. Stack can send as
low as 76 bytes MSS. This patch lowers the HW limit to 64 bytes to avoid
MDDs from firing and causing a reset when the MSS is lower than 256.

Change-ID: I36b500a6bb227d283c3e321a7718e0672b11fab0
Signed-off-by: Anjali Singhai Jain 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 7a4595a..781a6f4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6685,6 +6685,7 @@ static void i40e_reset_and_rebuild(struct i40e_pf *pf, 
bool reinit)
struct i40e_hw *hw = >hw;
u8 set_fc_aq_fail = 0;
i40e_status ret;
+   u32 val;
u32 v;
 
/* Now we wait for GRST to settle out.
@@ -6823,6 +6824,20 @@ static void i40e_reset_and_rebuild(struct i40e_pf *pf, 
bool reinit)
}
}
 
+   /* Reconfigure hardware for allowing smaller MSS in the case
+* of TSO, so that we avoid the MDD being fired and causing
+* a reset in the case of small MSS+TSO.
+*/
+#define I40E_REG_MSS  0x000E64DC
+#define I40E_REG_MSS_MIN_MASK 0x3FF
+#define I40E_64BYTE_MSS   0x40
+   val = rd32(hw, I40E_REG_MSS);
+   if ((val & I40E_REG_MSS_MIN_MASK) > I40E_64BYTE_MSS) {
+   val &= ~I40E_REG_MSS_MIN_MASK;
+   val |= I40E_64BYTE_MSS;
+   wr32(hw, I40E_REG_MSS, val);
+   }
+
if (((pf->hw.aq.fw_maj_ver == 4) && (pf->hw.aq.fw_min_ver < 33)) ||
(pf->hw.aq.fw_maj_ver < 4)) {
msleep(75);
@@ -10185,6 +10200,7 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
u16 link_status;
int err;
u32 len;
+   u32 val;
u32 i;
u8 set_fc_aq_fail;
 
@@ -10489,6 +10505,17 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 i40e_stat_str(>hw, err),
 i40e_aq_str(>hw, pf->hw.aq.asq_last_status));
 
+   /* Reconfigure hardware for allowing smaller MSS in the case
+* of TSO, so that we avoid the MDD being fired and causing
+* a reset in the case of small MSS+TSO.
+*/
+   val = rd32(hw, I40E_REG_MSS);
+   if ((val & I40E_REG_MSS_MIN_MASK) > I40E_64BYTE_MSS) {
+   val &= ~I40E_REG_MSS_MIN_MASK;
+   val |= I40E_64BYTE_MSS;
+   wr32(hw, I40E_REG_MSS, val);
+   }
+
if (((pf->hw.aq.fw_maj_ver == 4) && (pf->hw.aq.fw_min_ver < 33)) ||
(pf->hw.aq.fw_maj_ver < 4)) {
msleep(75);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 01/16] fm10k: use napi_schedule_irqoff()

2015-11-24 Thread Jeff Kirsher

From: Alexander Duyck 

The fm10k_msix_clean_rings function runs from hard interrupt context or
with interrupts already disabled in netpoll.

It can use napi_schedule_irqoff() instead of napi_schedule()

Signed-off-by: Alexander Duyck 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 74be792..5fbffba 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -846,7 +846,7 @@ static irqreturn_t fm10k_msix_clean_rings(int 
__always_unused irq, void *data)
struct fm10k_q_vector *q_vector = data;
 
if (q_vector->rx.count || q_vector->tx.count)
-   napi_schedule(_vector->napi);
+   napi_schedule_irqoff(_vector->napi);
 
return IRQ_HANDLED;
 }
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 02/16] i40e/i40evf: remove unused tunnel parameter

2015-11-24 Thread Jeff Kirsher

From: Shannon Nelson 

Code was moved into a separate function some time ago.

Change-ID: Icabbe71ce05cf5d716d3e1152cdd9cd41d11bcb5
Signed-off-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 11 ---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  8 +++-
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6649ce4..98680b6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2186,14 +2186,12 @@ out:
  * @tx_ring:  ptr to the ring to send
  * @skb:  ptr to the skb we're sending
  * @hdr_len:  ptr to the size of the packet header
- * @cd_type_cmd_tso_mss: ptr to u64 object
- * @cd_tunneling: ptr to context descriptor bits
+ * @cd_type_cmd_tso_mss: Quad Word 1
  *
  * Returns 0 if no TSO can happen, 1 if tso is going, or error
  **/
 static int i40e_tso(struct i40e_ring *tx_ring, struct sk_buff *skb,
-   u8 *hdr_len, u64 *cd_type_cmd_tso_mss,
-   u32 *cd_tunneling)
+   u8 *hdr_len, u64 *cd_type_cmd_tso_mss)
 {
u32 cd_cmd, cd_tso_len, cd_mss;
struct ipv6hdr *ipv6h;
@@ -2246,7 +2244,7 @@ static int i40e_tso(struct i40e_ring *tx_ring, struct 
sk_buff *skb,
  * @tx_ring:  ptr to the ring to send
  * @skb:  ptr to the skb we're sending
  * @tx_flags: the collected send information
- * @cd_type_cmd_tso_mss: ptr to u64 object
+ * @cd_type_cmd_tso_mss: Quad Word 1
  *
  * Returns 0 if no Tx timestamp can happen and 1 if the timestamp will happen
  **/
@@ -2825,8 +2823,7 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff 
*skb,
else if (protocol == htons(ETH_P_IPV6))
tx_flags |= I40E_TX_FLAGS_IPV6;
 
-   tso = i40e_tso(tx_ring, skb, _len,
-  _type_cmd_tso_mss, _tunneling);
+   tso = i40e_tso(tx_ring, skb, _len, _type_cmd_tso_mss);
 
if (tso < 0)
goto out_drop;
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 77968b1..a4eea08 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1436,13 +1436,12 @@ out:
  * @tx_ring:  ptr to the ring to send
  * @skb:  ptr to the skb we're sending
  * @hdr_len:  ptr to the size of the packet header
- * @cd_tunneling: ptr to context descriptor bits
+ * @cd_type_cmd_tso_mss: Quad Word 1
  *
  * Returns 0 if no TSO can happen, 1 if tso is going, or error
  **/
 static int i40e_tso(struct i40e_ring *tx_ring, struct sk_buff *skb,
-   u8 *hdr_len, u64 *cd_type_cmd_tso_mss,
-   u32 *cd_tunneling)
+   u8 *hdr_len, u64 *cd_type_cmd_tso_mss)
 {
u32 cd_cmd, cd_tso_len, cd_mss;
struct ipv6hdr *ipv6h;
@@ -1979,8 +1978,7 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff 
*skb,
else if (protocol == htons(ETH_P_IPV6))
tx_flags |= I40E_TX_FLAGS_IPV6;
 
-   tso = i40e_tso(tx_ring, skb, _len,
-  _type_cmd_tso_mss, _tunneling);
+   tso = i40e_tso(tx_ring, skb, _len, _type_cmd_tso_mss);
 
if (tso < 0)
goto out_drop;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net-next 16/16] i40e: Bump version to 1.4.2

2015-11-24 Thread Jeff Kirsher

From: Catherine Sullivan 

Bump.

Change-ID: I2d1ce93b2ce74e4eef2394c932aef52cba99713f
Signed-off-by: Catherine Sullivan 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 84b1962..4b7d874 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -38,8 +38,8 @@ static const char i40e_driver_string[] =
 #define DRV_KERN "-k"
 
 #define DRV_VERSION_MAJOR 1
-#define DRV_VERSION_MINOR 3
-#define DRV_VERSION_BUILD 46
+#define DRV_VERSION_MINOR 4
+#define DRV_VERSION_BUILD 2
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD)DRV_KERN
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bridge-utils: wrong sysfs path odds

2015-11-24 Thread Stephen Hemminger

On Wed, 25 Nov 2015 01:24:47 +0100
Richard Weinberger  wrote:

> Am 25.11.2015 um 01:15 schrieb Richard Weinberger:
> > Hi!
> > 
> > Today I was hunting down an issue where "brctl stp br0 off"
> > always failed on mips64be with n32 userland.
> > 
> > It turned out that the ioctl(fd, SIOCDEVPRIVATE, ) with 
> > BRCTL_SET_BRIDGE_STP_STATE
> > returned -EOPNOTSUPP.
> > First I thought that this is a plain ABI issue on mips as in old_dev_ioctl()
> > the ioctl() argument was 0x1 instead of the expected 
> > BRCTL_SET_BRIDGE_STP_STATE (0x14)
> 
> Should be 0xe and not 0x14. It is 14 in decimal. :)
> 
> Thanks,
> //richard

Ask Debian maintainer to send his patches, I don't go patch hunting.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bridge-utils: wrong sysfs path odds

2015-11-24 Thread Richard Weinberger

Am 25.11.2015 um 01:37 schrieb Stephen Hemminger:
> On Wed, 25 Nov 2015 01:24:47 +0100
> Richard Weinberger  wrote:
> 
>> Am 25.11.2015 um 01:15 schrieb Richard Weinberger:
>>> Hi!
>>>
>>> Today I was hunting down an issue where "brctl stp br0 off"
>>> always failed on mips64be with n32 userland.
>>>
>>> It turned out that the ioctl(fd, SIOCDEVPRIVATE, ) with 
>>> BRCTL_SET_BRIDGE_STP_STATE
>>> returned -EOPNOTSUPP.
>>> First I thought that this is a plain ABI issue on mips as in old_dev_ioctl()
>>> the ioctl() argument was 0x1 instead of the expected 
>>> BRCTL_SET_BRIDGE_STP_STATE (0x14)
>>
>> Should be 0xe and not 0x14. It is 14 in decimal. :)
>>
>> Thanks,
>> //richard
> 
> Ask Debian maintainer to send his patches, I don't go patch hunting.

He is Cc'ed :-)

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 06/16] i40e: Properly cast type for arithmetic

2015-11-24 Thread Joe Perches

On Tue, 2015-11-24 at 16:04 -0800, Jeff Kirsher wrote:
> From: Helin Zhang 
> 
> Pointer of type void * shouldn't be used in arithmetic, which may
> result in compilation error. Casting of (u8 *) can be added to fix
> that.
> 

void * arithmetic is used quite frequently in the kernel.

What compiler emits an error?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 11/16] i40e/i40evf: clean up error messages

2015-11-24 Thread Joe Perches

On Tue, 2015-11-24 at 16:04 -0800, Jeff Kirsher wrote:
> Clean up and enhance error messages related to VF MAC/VLAN filters.
> Indicate which VF is having issues, and if possible indicate the MAC
> address or VLAN involved.

trivia:


> diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
> b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
[]
> @@ -1623,7 +1623,8 @@ static int i40e_vc_add_mac_addr_msg(struct i40e_vf *vf, 
> u8 *msg, u16 msglen)
>  
>   if (!f) {
>   dev_err(>pdev->dev,
> - "Unable to add VF MAC filter\n");
> + "Unable to add MAC filter %pM for VF %d\n",
> +  al->list[i].addr, vf->vf_id);

Maybe use %hu for %d?
(here and elsewhere)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] MAINTAINERS: PHY: Change maintainer to reviewer

2015-11-24 Thread Joe Perches

On Tue, 2015-11-24 at 15:29 -0800, Florian Fainelli wrote:
> Now that there is a reviewer role, add myself as reviewer since the PHY
> library code is maintained via the networking tree.

[]

> diff --git a/MAINTAINERS b/MAINTAINERS
[]
> @@ -4195,7 +4195,7 @@ F:  include/linux/netfilter_bridge/
>  F:>  > net/bridge/
>  
>  ETHERNET PHY LIBRARY
> -M:   Florian Fainelli 
> +R:   Florian Fainelli 
>  L:   netdev@vger.kernel.org
>  S:   Maintained
>  F:   include/linux/phy.h

Just because the upstream path is not a tree you
manage doesn't mean you aren't a maintainer.

I think the status should be something other than
"Maintained" if you are not the maintainer.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> Swap the unix_state_lock and

s/lock/unlock/ :-(
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 06/16] i40e: Properly cast type for arithmetic

2015-11-24 Thread David Miller

From: Joe Perches 
Date: Tue, 24 Nov 2015 16:43:32 -0800

> On Tue, 2015-11-24 at 16:04 -0800, Jeff Kirsher wrote:
>> From: Helin Zhang 
>> 
>> Pointer of type void * shouldn't be used in arithmetic, which may
>> result in compilation error. Casting of (u8 *) can be added to fix
>> that.
>> 
> 
> void * arithmetic is used quite frequently in the kernel.
> 
> What compiler emits an error?

Agreed, "void *" arithmetic should work universally with all compilers
used to build the kernel, otherwise so much crap would break.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Eric Dumazet

On Tue, Nov 24, 2015 at 5:10 PM, Rainer Weikusat
 wrote:
>
> The af_unix part of this, yes, ie, what gets allocated in
> unix_create1. But neither the socket inode nor the struct sock
> originally passed to unix_create. Since these are part of the same
> umbrella structure, they'll both be freed as consequence of the
> sock_release iput. As far as I can tell (I don't claim that I'm
> necessarily right on this, this is just the result of spending ca 2h
> reading the code with the problem report in mind and looking for
> something which could cause it), doing a sock_hold on the unix peer of
> the socket in unix_stream_sendmsg is indeed not needed, however, there's
> no additional reference to the inode or the struct sock accompanying it,
> ie, both of these will be freed by unix_release_sock. This also affects
> unix_dgram_sendmsg.
>
> It's also easy to verify: Swap the unix_state_lock and
> other->sk_data_ready and see if the issue still occurs. Right now (this
> may change after I had some sleep as it's pretty late for me), I don't
> think there's another local fix: The ->sk_data_ready accesses a
> pointer after the lock taken by the code which will clear and
> then later free it was released.

It seems that :

int sock_wake_async(struct socket *sock, int how, int band)

should really be changed to

int sock_wake_async(struct socket_wq *wq, int how, int band)

So that RCU rules (already present) apply safely.

sk->sk_socket is inherently racy (that is : racy without using
sk_callback_lock rwlock )

Other possibility would be _not_ calling sock_orphan() from unix_release_sock()
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Tue, Nov 24, 2015 at 3:34 PM, Rainer Weikusat
>  wrote:
>> Eric Dumazet  writes:
>>> On Tue, Nov 24, 2015 at 6:18 AM, Dmitry Vyukov  wrote:
 Hello,

 The following program triggers use-after-free in sock_wake_async:
>>
>> [...]
>>
 void *thr1(void *arg)
 {
 syscall(SYS_close, r2, 0, 0, 0, 0, 0);
 return 0;
 }

 void *thr2(void *arg)
 {
 syscall(SYS_write, r3, 0x20003000ul, 0xe7ul, 0, 0, 0);
 return 0;
 }
>>
>> [...]
>>
 pthread_t th[3];
 pthread_create([0], 0, thr0, 0);
 pthread_create([1], 0, thr1, 0);
 pthread_create([2], 0, thr2, 0);
 pthread_join(th[0], 0);
 pthread_join(th[1], 0);
 pthread_join(th[2], 0);
 return 0;
 }
>>
>> [...]
>>
>>> Looks like commit 830a1e5c212fb3fdc83b66359c780c3b3a294897 should be 
>>> reverted ?
>>>
>>> commit 830a1e5c212fb3fdc83b66359c780c3b3a294897
>>> Author: Benjamin LaHaise 
>>> Date:   Tue Dec 13 23:22:32 2005 -0800
>>>
>>> [AF_UNIX]: Remove superfluous reference counting in unix_stream_sendmsg
>>>
>>> AF_UNIX stream socket performance on P4 CPUs tends to suffer due to a
>>> lot of pipeline flushes from atomic operations.  The patch below
>>> removes the sock_hold() and sock_put() in unix_stream_sendmsg().  This
>>> should be safe as the socket still holds a reference to its peer which
>>> is only released after the file descriptor's final user invokes
>>> unix_release_sock().  The only consideration is that we must add a
>>> memory barrier before setting the peer initially.
>>>
>>> Signed-off-by: Benjamin LaHaise 
>>> Signed-off-by: David S. Miller 
>>
>> JFTR: This seems to be unrelated. (As far as I understand this), the
>> problem is that sk_wake_async accesses sk->sk_socket. That's invoked via
>> the
>>
>> other->sk_data_ready(other)
>>
>> in unix_stream_sendmsg after an
>>
>> unix_state_unlock(other);
>>
>> because of this, it can race with the code in unix_release_sock clearing
>> this pointer (via sock_orphan). The structure this pointer points to is
>> freed via iput in sock_release (net/socket.c) after the af_unix release
>> routine returned (it's really one part of a "twin structure" with the
>> socket inode being the other).
>>
>> A quick way to test if this was true would be to swap the
>>
>> unix_state_unlock(other);
>> other->sk_data_ready(other);
>>
>> in unix_stream_sendmsg and in case it is, a very 'hacky' fix could be to
>> put a pointer to the socket inode into the struct unix_sock, do an iget
>> on that in unix_create1 and a corresponding iput in
>> unix_sock_destructor.
>
> This is interesting, but is not the problem or/and the fix.
>
> We are supposed to own a reference on the 'other' socket or make sure
> it cannot disappear under us.

The af_unix part of this, yes, ie, what gets allocated in
unix_create1. But neither the socket inode nor the struct sock
originally passed to unix_create. Since these are part of the same
umbrella structure, they'll both be freed as consequence of the
sock_release iput. As far as I can tell (I don't claim that I'm
necessarily right on this, this is just the result of spending ca 2h
reading the code with the problem report in mind and looking for
something which could cause it), doing a sock_hold on the unix peer of
the socket in unix_stream_sendmsg is indeed not needed, however, there's
no additional reference to the inode or the struct sock accompanying it,
ie, both of these will be freed by unix_release_sock. This also affects
unix_dgram_sendmsg.

It's also easy to verify: Swap the unix_state_lock and
other->sk_data_ready and see if the issue still occurs. Right now (this
may change after I had some sleep as it's pretty late for me), I don't
think there's another local fix: The ->sk_data_ready accesses a
pointer after the lock taken by the code which will clear and
then later free it was released.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net: Generalise wq_has_sleeper helper

2015-11-24 Thread Herbert Xu

On Tue, Nov 24, 2015 at 04:30:25PM -0500, David Miller wrote:
>
> I'm fine with wherever this patch goes.  Herbert is there any
> particular tree where it'll facilitate another user quickest?
> 
> Or should I just toss it into net-next?
> 
> Acked-by: David S. Miller 

No Dave net-next is fine I think.  This was prompted by Tatsukawa-san's
patches to fix waitqueue users affected by this very race and they
were all over the tree.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Eric Dumazet

Dmitry, could you test following patch with your setup ?

( I tried to reproduce the error you reported but could not )

Inode can be freed (without RCU grace period), but not the socket or
sk_wq

By using sk_wq in the critical paths, we do not dereference the inode,



Thanks !

 include/linux/net.h |2 +-
 include/net/sock.h  |8 ++--
 net/core/stream.c   |2 +-
 net/sctp/socket.c   |6 +-
 net/socket.c|   16 +---
 5 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 70ac5e28e6b7..6b93ec234ce8 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -202,7 +202,7 @@ enum {
SOCK_WAKE_URG,
 };
 
-int sock_wake_async(struct socket *sk, int how, int band);
+int sock_wake_async(struct socket *sock, struct socket_wq *wq, int how, int 
band);
 int sock_register(const struct net_proto_family *fam);
 void sock_unregister(int family);
 int __sock_create(struct net *net, int family, int type, int proto,
diff --git a/include/net/sock.h b/include/net/sock.h
index 7f89e4ba18d1..af78f9e7a218 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2007,8 +2007,12 @@ static inline unsigned long sock_wspace(struct sock *sk)
 
 static inline void sk_wake_async(struct sock *sk, int how, int band)
 {
-   if (sock_flag(sk, SOCK_FASYNC))
-   sock_wake_async(sk->sk_socket, how, band);
+   if (sock_flag(sk, SOCK_FASYNC)) {
+   rcu_read_lock();
+   sock_wake_async(sk->sk_socket, rcu_dereference(sk->sk_wq),
+   how, band);
+   rcu_read_unlock();
+   }
 }
 
 /* Since sk_{r,w}mem_alloc sums skb->truesize, even a small frame might
diff --git a/net/core/stream.c b/net/core/stream.c
index d70f77a0c889..92682228919d 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -39,7 +39,7 @@ void sk_stream_write_space(struct sock *sk)
wake_up_interruptible_poll(>wait, POLLOUT |
POLLWRNORM | POLLWRBAND);
if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
-   sock_wake_async(sock, SOCK_WAKE_SPACE, POLL_OUT);
+   sock_wake_async(sock, wq, SOCK_WAKE_SPACE, POLL_OUT);
rcu_read_unlock();
}
 }
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 897c01c029ca..6ab04866a1e7 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6817,9 +6817,13 @@ static void __sctp_write_space(struct sctp_association 
*asoc)
 * here by modeling from the current TCP/UDP code.
 * We have not tested with it yet.
 */
-   if (!(sk->sk_shutdown & SEND_SHUTDOWN))
+   if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
+   rcu_read_lock();
sock_wake_async(sock,
+   rcu_dereference(sk->sk_wq),
SOCK_WAKE_SPACE, POLL_OUT);
+   rcu_read_unlock();
+   }
}
}
 }
diff --git a/net/socket.c b/net/socket.c
index dd2c247c99e3..8df62c8bef90 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1058,18 +1058,12 @@ static int sock_fasync(int fd, struct file *filp, int 
on)
 
 /* This function may be called only under socket lock or callback_lock or 
rcu_lock */
 
-int sock_wake_async(struct socket *sock, int how, int band)
+int sock_wake_async(struct socket *sock, struct socket_wq *wq,
+   int how, int band)
 {
-   struct socket_wq *wq;
-
-   if (!sock)
-   return -1;
-   rcu_read_lock();
-   wq = rcu_dereference(sock->wq);
-   if (!wq || !wq->fasync_list) {
-   rcu_read_unlock();
+   if (!sock || !wq || !wq->fasync_list)
return -1;
-   }
+
switch (how) {
case SOCK_WAKE_WAITD:
if (test_bit(SOCK_ASYNC_WAITDATA, >flags))
@@ -1086,7 +1080,7 @@ call_kill:
case SOCK_WAKE_URG:
kill_fasync(>fasync_list, SIGURG, band);
}
-   rcu_read_unlock();
+
return 0;
 }
 EXPORT_SYMBOL(sock_wake_async);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu

On 2015年11月24日 22:20, Alexander Duyck wrote:
> I'm still not a fan of this approach.  I really feel like this is
> something that should be resolved by extending the existing PCI hot-plug
> rather than trying to instrument this per driver.  Then you will get the
> goodness for multiple drivers and multiple OSes instead of just one.  An
> added advantage to dealing with this in the PCI hot-plug environment
> would be that you could then still do a hot-plug even if the guest
> didn't load a driver for the VF since you would be working with the PCI
> slot instead of the device itself.
> 
> - Alex

Hi Alex:
What's you mentioned seems the bonding driver solution.
Paper "Live Migration with Pass-through Device for Linux VM" describes
it. It does VF hotplug during migration. In order to maintain Network
connection when VF is out, it takes advantage of Linux bonding driver to
switch between VF NIC and emulated NIC. But the side affects, that
requires VM to do additional configure and the performance during
switching two NIC is not good.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Alexander Duyck

On Tue, Nov 24, 2015 at 7:18 PM, Lan Tianyu  wrote:
> On 2015年11月24日 22:20, Alexander Duyck wrote:
>> I'm still not a fan of this approach.  I really feel like this is
>> something that should be resolved by extending the existing PCI hot-plug
>> rather than trying to instrument this per driver.  Then you will get the
>> goodness for multiple drivers and multiple OSes instead of just one.  An
>> added advantage to dealing with this in the PCI hot-plug environment
>> would be that you could then still do a hot-plug even if the guest
>> didn't load a driver for the VF since you would be working with the PCI
>> slot instead of the device itself.
>>
>> - Alex
>
> Hi Alex:
> What's you mentioned seems the bonding driver solution.
> Paper "Live Migration with Pass-through Device for Linux VM" describes
> it. It does VF hotplug during migration. In order to maintain Network
> connection when VF is out, it takes advantage of Linux bonding driver to
> switch between VF NIC and emulated NIC. But the side affects, that
> requires VM to do additional configure and the performance during
> switching two NIC is not good.

No, what I am getting at is that you can't go around and modify the
configuration space for every possible device out there.  This
solution won't scale.  If you instead moved the logic for notifying
the device into a separate mechanism such as making it a part of the
hot-plug logic then you only have to write the code once per OS in
order to get the hot-plug capability to pause/resume the device.  What
I am talking about is not full hot-plug, but rather to extend the
existing hot-plug in Qemu and the Linux kernel to support a
"pause/resume" functionality.  The PCI hot-plug specification calls
out the option of implementing something like this, but we don't
currently have support for it.

I just feel doing it through PCI hot-plug messages will scale much
better as you could likely make use of the power management
suspend/resume calls to take care of most of the needed implementation
details.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Alexander Duyck

On Tue, Nov 24, 2015 at 1:20 PM, Michael S. Tsirkin  wrote:
> On Tue, Nov 24, 2015 at 09:38:18PM +0800, Lan Tianyu wrote:
>> This patch is to add migration support for ixgbevf driver. Using
>> faked PCI migration capability table communicates with Qemu to
>> share migration status and mailbox irq vector index.
>>
>> Qemu will notify VF via sending MSIX msg to trigger mailbox
>> vector during migration and store migration status in the
>> PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
>> The mailbox irq will be triggered just befoe stop-and-copy stage
>> and after migration on the target machine.
>>
>> VF driver will put down net when detect migration and tell
>> Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
>> reg. After migration, put up net again.
>>
>> Qemu will in charge of migrating PCI config space regs and MSIX config.
>>
>> The patch is to dedicate on the normal case that net traffic works
>> when mailbox irq is enabled. For other cases(such as the driver
>> isn't loaded, adapter is suspended or closed), mailbox irq won't be
>> triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
>> reg. These case will be resolved later.
>>
>> Signed-off-by: Lan Tianyu 
>
> I have to say, I was much more interested in the idea
> of tracking dirty memory. I have some thoughts about
> that one - did you give up on it then?

The tracking of dirty pages still needs to be addressed unless the
interface is being downed before migration even starts which based on
other comments I am assuming is not the case.

I still feel that having a means of marking a page as being dirty when
it is unmapped would be the best way to go.  That way you only have to
update the DMA API instead of messing with each and every driver
trying to add code to force the page to be dirtied.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Eric Dumazet

On Tue, 2015-11-24 at 18:28 -0800, Eric Dumazet wrote:
> Dmitry, could you test following patch with your setup ?
> 
> ( I tried to reproduce the error you reported but could not )
> 
> Inode can be freed (without RCU grace period), but not the socket or
> sk_wq
> 
> By using sk_wq in the critical paths, we do not dereference the inode,
> 
> 

I finally was able to reproduce the warning (with more instances running
in parallel), and apparently this patch solves the problem.

> 
> Thanks !
> 
>  include/linux/net.h |2 +-
>  include/net/sock.h  |8 ++--
>  net/core/stream.c   |2 +-
>  net/sctp/socket.c   |6 +-
>  net/socket.c|   16 +---
>  5 files changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 70ac5e28e6b7..6b93ec234ce8 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -202,7 +202,7 @@ enum {
>   SOCK_WAKE_URG,
>  };
>  
> -int sock_wake_async(struct socket *sk, int how, int band);
> +int sock_wake_async(struct socket *sock, struct socket_wq *wq, int how, int 
> band);
>  int sock_register(const struct net_proto_family *fam);
>  void sock_unregister(int family);
>  int __sock_create(struct net *net, int family, int type, int proto,
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7f89e4ba18d1..af78f9e7a218 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2007,8 +2007,12 @@ static inline unsigned long sock_wspace(struct sock 
> *sk)
>  
>  static inline void sk_wake_async(struct sock *sk, int how, int band)
>  {
> - if (sock_flag(sk, SOCK_FASYNC))
> - sock_wake_async(sk->sk_socket, how, band);
> + if (sock_flag(sk, SOCK_FASYNC)) {
> + rcu_read_lock();
> + sock_wake_async(sk->sk_socket, rcu_dereference(sk->sk_wq),
> + how, band);
> + rcu_read_unlock();
> + }
>  }
>  
>  /* Since sk_{r,w}mem_alloc sums skb->truesize, even a small frame might
> diff --git a/net/core/stream.c b/net/core/stream.c
> index d70f77a0c889..92682228919d 100644
> --- a/net/core/stream.c
> +++ b/net/core/stream.c
> @@ -39,7 +39,7 @@ void sk_stream_write_space(struct sock *sk)
>   wake_up_interruptible_poll(>wait, POLLOUT |
>   POLLWRNORM | POLLWRBAND);
>   if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
> - sock_wake_async(sock, SOCK_WAKE_SPACE, POLL_OUT);
> + sock_wake_async(sock, wq, SOCK_WAKE_SPACE, POLL_OUT);
>   rcu_read_unlock();
>   }
>  }
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 897c01c029ca..6ab04866a1e7 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -6817,9 +6817,13 @@ static void __sctp_write_space(struct sctp_association 
> *asoc)
>* here by modeling from the current TCP/UDP code.
>* We have not tested with it yet.
>*/
> - if (!(sk->sk_shutdown & SEND_SHUTDOWN))
> + if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
> + rcu_read_lock();
>   sock_wake_async(sock,
> + rcu_dereference(sk->sk_wq),
>   SOCK_WAKE_SPACE, POLL_OUT);
> + rcu_read_unlock();
> + }
>   }
>   }
>  }
> diff --git a/net/socket.c b/net/socket.c
> index dd2c247c99e3..8df62c8bef90 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -1058,18 +1058,12 @@ static int sock_fasync(int fd, struct file *filp, int 
> on)
>  
>  /* This function may be called only under socket lock or callback_lock or 
> rcu_lock */
>  
> -int sock_wake_async(struct socket *sock, int how, int band)
> +int sock_wake_async(struct socket *sock, struct socket_wq *wq,
> + int how, int band)
>  {
> - struct socket_wq *wq;
> -
> - if (!sock)
> - return -1;
> - rcu_read_lock();
> - wq = rcu_dereference(sock->wq);
> - if (!wq || !wq->fasync_list) {
> - rcu_read_unlock();
> + if (!sock || !wq || !wq->fasync_list)
>   return -1;
> - }
> +
>   switch (how) {
>   case SOCK_WAKE_WAITD:
>   if (test_bit(SOCK_ASYNC_WAITDATA, >flags))
> @@ -1086,7 +1080,7 @@ call_kill:
>   case SOCK_WAKE_URG:
>   kill_fasync(>fasync_list, SIGURG, band);
>   }
> - rcu_read_unlock();
> +
>   return 0;
>  }
>  EXPORT_SYMBOL(sock_wake_async);
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Lan Tianyu

On 2015年11月25日 05:20, Michael S. Tsirkin wrote:
> I have to say, I was much more interested in the idea
> of tracking dirty memory. I have some thoughts about
> that one - did you give up on it then?

No, our finial target is to keep VF active before doing
migration and tracking dirty memory is essential. But this
seems not easy to do that in short term for upstream. As
starters, stop VF before migration.

After deep thinking, the way of stopping VF still needs tracking
DMA-accessed dirty memory to make sure the received data buffer
before stopping VF migrated. It's easier to do that via dummy writing
data buffer when receive packet.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] isdn: Partially revert debug format string usage clean up

2015-11-24 Thread Christoph Biedl

Commit 35a4a57 ("isdn: clean up debug format string usage") introduced
a safeguard to avoid accidential format string interpolation of data
when calling debugl1 or HiSax_putstatus. This did however not take into
account VHiSax_putstatus (called by HiSax_putstatus) does *not* call
vsprintf if the head parameter is NULL - the format string is treated
as plain text then instead. As a result, the string "%s" is processed
literally, and the actual information is lost. This affects the isdnlog
userspace program which stopped logging information since that commit.

So revert the HiSax_putstatus invocations to the previous state.

Fixes: 35a4a5733b0a ("isdn: clean up debug format string usage")
Cc: Kees Cook 
Cc: Karsten Keil 
Signed-off-by: Christoph Biedl 
---
 drivers/isdn/hisax/config.c  | 2 +-
 drivers/isdn/hisax/hfc_pci.c | 2 +-
 drivers/isdn/hisax/hfc_sx.c  | 2 +-
 drivers/isdn/hisax/q931.c| 6 +++---
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/isdn/hisax/config.c b/drivers/isdn/hisax/config.c
index b33f53b..bf04d2a 100644
--- a/drivers/isdn/hisax/config.c
+++ b/drivers/isdn/hisax/config.c
@@ -1896,7 +1896,7 @@ static void EChannel_proc_rcv(struct hisax_d_if *d_if)
ptr--;
*ptr++ = '\n';
*ptr = 0;
-   HiSax_putstatus(cs, NULL, "%s", cs->dlog);
+   HiSax_putstatus(cs, NULL, cs->dlog);
} else
HiSax_putstatus(cs, "LogEcho: ",
"warning Frame too big (%d)",
diff --git a/drivers/isdn/hisax/hfc_pci.c b/drivers/isdn/hisax/hfc_pci.c
index 4a48255..90449e1 100644
--- a/drivers/isdn/hisax/hfc_pci.c
+++ b/drivers/isdn/hisax/hfc_pci.c
@@ -901,7 +901,7 @@ Begin:
ptr--;
*ptr++ = '\n';
*ptr = 0;
-   HiSax_putstatus(cs, NULL, "%s", 
cs->dlog);
+   HiSax_putstatus(cs, NULL, cs->dlog);
} else
HiSax_putstatus(cs, "LogEcho: ", 
"warning Frame too big (%d)", total - 3);
}
diff --git a/drivers/isdn/hisax/hfc_sx.c b/drivers/isdn/hisax/hfc_sx.c
index b1fad81..13b2151 100644
--- a/drivers/isdn/hisax/hfc_sx.c
+++ b/drivers/isdn/hisax/hfc_sx.c
@@ -674,7 +674,7 @@ receive_emsg(struct IsdnCardState *cs)
ptr--;
*ptr++ = '\n';
*ptr = 0;
-   HiSax_putstatus(cs, NULL, "%s", 
cs->dlog);
+   HiSax_putstatus(cs, NULL, cs->dlog);
} else
HiSax_putstatus(cs, "LogEcho: ", 
"warning Frame too big (%d)", skb->len);
}
diff --git a/drivers/isdn/hisax/q931.c b/drivers/isdn/hisax/q931.c
index b420f8b..ba4beb2 100644
--- a/drivers/isdn/hisax/q931.c
+++ b/drivers/isdn/hisax/q931.c
@@ -1179,7 +1179,7 @@ LogFrame(struct IsdnCardState *cs, u_char *buf, int size)
dp--;
*dp++ = '\n';
*dp = 0;
-   HiSax_putstatus(cs, NULL, "%s", cs->dlog);
+   HiSax_putstatus(cs, NULL, cs->dlog);
} else
HiSax_putstatus(cs, "LogFrame: ", "warning Frame too big (%d)", 
size);
 }
@@ -1246,7 +1246,7 @@ dlogframe(struct IsdnCardState *cs, struct sk_buff *skb, 
int dir)
}
if (finish) {
*dp = 0;
-   HiSax_putstatus(cs, NULL, "%s", cs->dlog);
+   HiSax_putstatus(cs, NULL, cs->dlog);
return;
}
if ((0xfe & buf[0]) == PROTO_DIS_N0) {  /* 1TR6 */
@@ -1509,5 +1509,5 @@ dlogframe(struct IsdnCardState *cs, struct sk_buff *skb, 
int dir)
dp += sprintf(dp, "Unknown protocol %x!", buf[0]);
}
*dp = 0;
-   HiSax_putstatus(cs, NULL, "%s", cs->dlog);
+   HiSax_putstatus(cs, NULL, cs->dlog);
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal

Tom Herbert  wrote:
> On Tue, Nov 24, 2015 at 12:55 PM, Florian Westphal  wrote:
> > Why anyone would invest such a huge amount of work in making this
> > kernel-based framing for single-stream tcp record (de)mux rather than
> > improving the userspace protocol to use UDP or SCTP or at least
> > one tcp connection per worker is beyond me.
> >
> From the /0 patch:
> 
> Q: Why not use an existing message-oriented protocol such as RUDP,
>DCCP, SCTP, RDS, and others?
> 
> A: Because that would entail using a completely new transport protocol.

Thats why I wrote 'or at least one tcp connection per worker'.

> > For TX side, why is writev not good enough?
> 
> writev on a TCP stream does not guarantee atomicity of the operation.

Are you talking about short writes?

> It writes atomic without user space needing to implement locking when
> a socket is shared amongst threads.

Yes, I get that point, but I maintain that KCM is a strange workaround
for bad userspace design.

1 tcp connection per thread -> no userspace sockfd lock needed

Sender side can use writev, sendmsg, sendmmsg, etc to avoid sending
sub-record sized frames.

Is user space really so bad that instead of fixing it its simpler to
work around it with even more kernel bloat?

Since for KCM userspace has to be adjusted anyway I find that hard
to believe.

I don't know if the 'dynamic RCVLOWAT' that you want is needed
(you say 'yes', Eric reply seems to indicate its not (at least assuming
 a sane/friendly peer that doesn't intentionally xmit byte-by-byte).

But assuming there would really be a benefit, maybe a RCVLOWAT2 could
be added?  Of course we could only make it a hint and would have to
make a blocking read return with less data than desired when tcp rmem limit
gets hit.  But at least we'd avoid the 'unbounded allocation of large
amount of kernel memory' thing that we have with current proposal.

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Hannes Frederic Sowa

Hi David,

On Tue, Nov 24, 2015, at 23:25, David Miller wrote:
> From: Florian Westphal 
> Date: Tue, 24 Nov 2015 23:22:42 +0100
> 
> > Yes, I get that point, but I maintain that KCM is a strange workaround
> > for bad userspace design.
> 
> I fundamentally disagree with you.
> 
> And even if I didn't, I would be remiss to completely dismiss the
> difficulty in changing existing protocols and existing large scale
> implementations of them.  If we can facilitate them somehow then
> I see nothing wrong with that.
> 
> Neither you nor Hannes have made a strong enough argument for me
> to consider Tom's work not suitable for upstream.
> 
> Have you even looked at the example userspace use case he referenced
> and considered the constraints under which it operates?  I seriously
> doubt you did.

If you are referring to thrift and tls framing, yes indeed, I did. I
have experience in google protocol buffers and once cared about an
in-house RPC implementation. All I learned is that this approach is
prone to starving or building up huge messages in kernel space. That is
why xml streaming in form of StAX from the Java world is used more and
more and even Apache Jackson does provide a streaming API for JSON which
I once used because JSON messages streamed as hash tables got too big
and were prone to starve. Even user space needs to be careful what sizes
of messages they accept otherwise DoS attacks are possible and jvms with
small heaps are getting OutOfMemoryExceptions. This is the same in other
high level languages, non-GCed (or without copying garbage collector)
languages just reallocate and cause fragmentation in the long term. Even
keeping multiple 16MB chunks for HTTP/2 in the kernel heap so that user
space can read them in on go seems very much bad in my opinion.

Neither of all those approaches delimit datagrams by "read() barriers".
I think the alternatives should be tried. I think this framework is only
applicable to a small fractions of RPC systems.

Thanks for following up, :)
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v5] mpls: support for dead routes

2015-11-24 Thread Roopa Prabhu

From: Roopa Prabhu 

Adds support for RTNH_F_DEAD and RTNH_F_LINKDOWN flags on mpls
routes due to link events. Also adds code to ignore dead
routes during route selection.

Unlike ip routes, mpls routes are not deleted when the route goes
dead. This is current mpls behaviour and this patch does not change
that. With this patch however, routes will be marked dead.
dead routes are not notified to userspace (this is consistent with ipv4
routes).

dead routes:
---
$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2  dev swp1
nexthop as to 700 via inet 10.1.1.6  dev swp2

$ip link set dev swp1 down

$ip link show dev swp1
4: swp1:  mtu 1500 qdisc pfifo_fast state DOWN mode
DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff

$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2  dev swp1 dead linkdown
nexthop as to 700 via inet 10.1.1.6  dev swp2

linkdown routes:

$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2  dev swp1
nexthop as to 700 via inet 10.1.1.6  dev swp2

$ip link show dev swp1
4: swp1:  mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff

/* carrier goes down */
$ip link show dev swp1
4: swp1:  mtu 1500 qdisc pfifo_fast
state DOWN mode DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff

$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2  dev swp1 linkdown
nexthop as to 700 via inet 10.1.1.6  dev swp2

Signed-off-by: Roopa Prabhu 
---

RFC to v1:
Addressed a few comments from Eric and Robert:
- remove support for weighted nexthops
- Use rt_nhn_alive in the rt structure to keep count of alive
routes.
What i have not done is: sort nexthops on link events.
I am not comfortable recreating or sorting nexthops on
every carrier change. This leaves scope for optimizing in the
future

v1 to v2:
Fix dead nexthop checks as suggested by dave

v2 to v3:
Fix duplicated argument reported by kbuild test robot

v3 - v4:
- removed per route rt_flags and derive it from the nh_flags during 
dumps
- use kmemdup to make a copy of the route during route updates
  due to link events

v4 -v5
- if kmemdup fails, modify the original route in place. This is a
corner case and only side effect is that in the remote case
of kmemdup failure, the changes will not be atomically visible
to datapath.
- replace for_nexthops with change_nexthops in a bunch of places.
- fix indent


 net/mpls/af_mpls.c  | 250 
 net/mpls/internal.h |   2 +
 2 files changed, 215 insertions(+), 37 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index c70d750..2248015 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -96,22 +96,15 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned 
int mtu)
 }
 EXPORT_SYMBOL_GPL(mpls_pkt_too_big);
 
-static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt,
-struct sk_buff *skb, bool bos)
+static u32 mpls_multipath_hash(struct mpls_route *rt,
+  struct sk_buff *skb, bool bos)
 {
struct mpls_entry_decoded dec;
struct mpls_shim_hdr *hdr;
bool eli_seen = false;
int label_index;
-   int nh_index = 0;
u32 hash = 0;
 
-   /* No need to look further into packet if there's only
-* one path
-*/
-   if (rt->rt_nhn == 1)
-   goto out;
-
for (label_index = 0; label_index < MAX_MP_SELECT_LABELS && !bos;
 label_index++) {
if (!pskb_may_pull(skb, sizeof(*hdr) * label_index))
@@ -165,7 +158,37 @@ static struct mpls_nh *mpls_select_multipath(struct 
mpls_route *rt,
}
}
 
-   nh_index = hash % rt->rt_nhn;
+   return hash;
+}
+
+static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt,
+struct sk_buff *skb, bool bos)
+{
+   u32 hash = 0;
+   int nh_index = 0;
+   int n = 0;
+
+   /* No need to look further into packet if there's only
+* one path
+*/
+   if (rt->rt_nhn == 1)
+   goto out;
+
+   if (rt->rt_nhn_alive <= 0)
+   return NULL;
+
+   hash = mpls_multipath_hash(rt, skb, bos);
+   nh_index = hash % rt->rt_nhn_alive;
+   if (rt->rt_nhn_alive == rt->rt_nhn)
+   goto out;
+   for_nexthops(rt) {
+   if (nh->nh_flags & (RTNH_F_DEAD | RTNH_F_LINKDOWN))
+   continue;
+   if (n == nh_index)
+

Re: use-after-free in sock_wake_async

2015-11-24 Thread Eric Dumazet

On Tue, Nov 24, 2015 at 2:03 PM, Eric Dumazet  wrote:

>
> This might be a data race in sk_wake_async() if inlined by compiler
> (see https://lkml.org/lkml/2015/11/24/680 for another example)
>
> KASAN adds register pressure and compiler can then do 'stupid' things :(
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7f89e4ba18d1..2af6222ccc67 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2008,7 +2008,7 @@ static inline unsigned long sock_wspace(struct sock *sk)
>  static inline void sk_wake_async(struct sock *sk, int how, int band)
>  {
> if (sock_flag(sk, SOCK_FASYNC))
> -   sock_wake_async(sk->sk_socket, how, band);
> +   sock_wake_async(READ_ONCE(sk->sk_socket), how, band);
>  }
>
>  /* Since sk_{r,w}mem_alloc sums skb->truesize, even a small frame might

Oh well, sock_wake_async() can not be inlined, scratch this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] net: phy: bcm7xxx: Add entry for Broadcom BCM7435

2015-11-24 Thread Florian Fainelli

Add a PHY entry for the Broadcom BCM7435 chips, this is a 40nm
generation Ethernet PHY which is analogous to its 7425 and 7429 counter
parts.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/bcm7xxx.c | 14 ++
 include/linux/brcmphy.h   |  1 +
 2 files changed, 15 insertions(+)

diff --git a/drivers/net/phy/bcm7xxx.c b/drivers/net/phy/bcm7xxx.c
index 03d4809a9126..d4083c381cd1 100644
--- a/drivers/net/phy/bcm7xxx.c
+++ b/drivers/net/phy/bcm7xxx.c
@@ -361,6 +361,19 @@ static struct phy_driver bcm7xxx_driver[] = {
.resume = bcm7xxx_config_init,
.driver = { .owner = THIS_MODULE },
 }, {
+   .phy_id = PHY_ID_BCM7435,
+   .phy_id_mask= 0xfff0,
+   .name   = "Broadcom BCM7435",
+   .features   = PHY_GBIT_FEATURES |
+ SUPPORTED_Pause | SUPPORTED_Asym_Pause,
+   .flags  = PHY_IS_INTERNAL,
+   .config_init= bcm7xxx_config_init,
+   .config_aneg= genphy_config_aneg,
+   .read_status= genphy_read_status,
+   .suspend= bcm7xxx_suspend,
+   .resume = bcm7xxx_config_init,
+   .driver = { .owner = THIS_MODULE },
+}, {
.phy_id = PHY_BCM_OUI_4,
.phy_id_mask= 0x,
.name   = "Broadcom BCM7XXX 40nm",
@@ -395,6 +408,7 @@ static struct mdio_device_id __maybe_unused bcm7xxx_tbl[] = 
{
{ PHY_ID_BCM7425, 0xfff0, },
{ PHY_ID_BCM7429, 0xfff0, },
{ PHY_ID_BCM7439, 0xfff0, },
+   { PHY_ID_BCM7435, 0xfff0, },
{ PHY_ID_BCM7445, 0xfff0, },
{ PHY_BCM_OUI_4, 0x },
{ PHY_BCM_OUI_5, 0xff00 },
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index 59f4a7304419..f0ba9c2ec639 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -26,6 +26,7 @@
 #define PHY_ID_BCM7366 0x600d8490
 #define PHY_ID_BCM7425 0x600d86b0
 #define PHY_ID_BCM7429 0x600d8730
+#define PHY_ID_BCM7435 0x600d8750
 #define PHY_ID_BCM7439 0x600d8480
 #define PHY_ID_BCM7439_2   0xae025080
 #define PHY_ID_BCM7445 0x600d8510
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] MAINTAINERS: PHY: Change maintainer to reviewer

2015-11-24 Thread Florian Fainelli

Now that there is a reviewer role, add myself as reviewer since the PHY
library code is maintained via the networking tree.

Signed-off-by: Florian Fainelli 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index ea1751283b49..950c321eef73 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4195,7 +4195,7 @@ F:include/linux/netfilter_bridge/
 F: net/bridge/
 
 ETHERNET PHY LIBRARY
-M: Florian Fainelli 
+R: Florian Fainelli 
 L: netdev@vger.kernel.org
 S: Maintained
 F: include/linux/phy.h
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

bridge-utils: wrong sysfs path odds

2015-11-24 Thread Richard Weinberger

Hi!

Today I was hunting down an issue where "brctl stp br0 off"
always failed on mips64be with n32 userland.

It turned out that the ioctl(fd, SIOCDEVPRIVATE, ) with 
BRCTL_SET_BRIDGE_STP_STATE
returned -EOPNOTSUPP.
First I thought that this is a plain ABI issue on mips as in old_dev_ioctl()
the ioctl() argument was 0x1 instead of the expected BRCTL_SET_BRIDGE_STP_STATE 
(0x14)

Further investigation showed that brctl first tries to open the sysfs file
"/sys/class/net/br0/stp_state" and falls back to the legacy ioctl() upon 
failure.

On my mips setup old_dev_ioctl() seems not to work. And the function's comment
is correct:
/*
 * Legacy ioctl's through SIOCDEVPRIVATE
 * This interface is deprecated because it was too difficult to
 * to do the translation for 32/64bit ioctl compatibility.
 */

Later I've realized that the sysfs path is wrong, the "bridge/" directory
part is missing.
On most setups nobody would notice as the fallback ioctl() works.

Debian's bridge-utils package carries a patch which fixes the sysfs paths.
Can we please have this patch also in upstream bridge-utils?

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH V2 3/3] Ixgbevf: Add migration support for ixgbevf driver

2015-11-24 Thread Michael S. Tsirkin

On Tue, Nov 24, 2015 at 09:38:18PM +0800, Lan Tianyu wrote:
> This patch is to add migration support for ixgbevf driver. Using
> faked PCI migration capability table communicates with Qemu to
> share migration status and mailbox irq vector index.
> 
> Qemu will notify VF via sending MSIX msg to trigger mailbox
> vector during migration and store migration status in the
> PCI_VF_MIGRATION_VMM_STATUS regs in the new capability table.
> The mailbox irq will be triggered just befoe stop-and-copy stage
> and after migration on the target machine.
> 
> VF driver will put down net when detect migration and tell
> Qemu it's ready for migration via writing PCI_VF_MIGRATION_VF_STATUS
> reg. After migration, put up net again.
> 
> Qemu will in charge of migrating PCI config space regs and MSIX config.
> 
> The patch is to dedicate on the normal case that net traffic works
> when mailbox irq is enabled. For other cases(such as the driver
> isn't loaded, adapter is suspended or closed), mailbox irq won't be
> triggered and VF driver will disable it via PCI_VF_MIGRATION_CAP
> reg. These case will be resolved later.
> 
> Signed-off-by: Lan Tianyu 

I have to say, I was much more interested in the idea
of tracking dirty memory. I have some thoughts about
that one - did you give up on it then?



> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 
> ++
>  2 files changed, 107 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 775d089..4b8ba2f 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -438,6 +438,11 @@ struct ixgbevf_adapter {
>   u64 bp_tx_missed;
>  #endif
>  
> + u8 migration_cap;
> + u8 last_migration_reg;
> + unsigned long migration_status;
> + struct work_struct migration_task;
> +
>   u8 __iomem *io_addr; /* Mainly for iounmap use */
>   u32 link_speed;
>   bool link_up;
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index a16d267..95860c2 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -96,6 +96,8 @@ static int debug = -1;
>  module_param(debug, int, 0);
>  MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
>  
> +#define MIGRATION_IN_PROGRESS0
> +
>  static void ixgbevf_service_event_schedule(struct ixgbevf_adapter *adapter)
>  {
>   if (!test_bit(__IXGBEVF_DOWN, >state) &&
> @@ -1262,6 +1264,22 @@ static void ixgbevf_set_itr(struct ixgbevf_q_vector 
> *q_vector)
>   }
>  }
>  
> +static void ixgbevf_migration_check(struct ixgbevf_adapter *adapter) 
> +{
> + struct pci_dev *pdev = adapter->pdev;
> + u8 val;
> +
> + pci_read_config_byte(pdev,
> +  adapter->migration_cap + PCI_VF_MIGRATION_VMM_STATUS,
> +  );
> +
> + if (val != adapter->last_migration_reg) {
> + schedule_work(>migration_task);
> + adapter->last_migration_reg = val;
> + }
> +
> +}
> +
>  static irqreturn_t ixgbevf_msix_other(int irq, void *data)
>  {
>   struct ixgbevf_adapter *adapter = data;
> @@ -1269,6 +1287,7 @@ static irqreturn_t ixgbevf_msix_other(int irq, void 
> *data)
>  
>   hw->mac.get_link_status = 1;
>  
> + ixgbevf_migration_check(adapter);
>   ixgbevf_service_event_schedule(adapter);
>  
>   IXGBE_WRITE_REG(hw, IXGBE_VTEIMS, adapter->eims_other);
> @@ -1383,6 +1402,7 @@ out:
>  static int ixgbevf_request_msix_irqs(struct ixgbevf_adapter *adapter)
>  {
>   struct net_device *netdev = adapter->netdev;
> + struct pci_dev *pdev = adapter->pdev;
>   int q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
>   int vector, err;
>   int ri = 0, ti = 0;
> @@ -1423,6 +1443,12 @@ static int ixgbevf_request_msix_irqs(struct 
> ixgbevf_adapter *adapter)
>   goto free_queue_irqs;
>   }
>  
> + if (adapter->migration_cap) {
> + pci_write_config_byte(pdev,
> + adapter->migration_cap + PCI_VF_MIGRATION_IRQ,
> + vector);
> + }
> +
>   return 0;
>  
>  free_queue_irqs:
> @@ -2891,6 +2917,59 @@ static void ixgbevf_watchdog_subtask(struct 
> ixgbevf_adapter *adapter)
>   ixgbevf_update_stats(adapter);
>  }
>  
> +static void ixgbevf_migration_task(struct work_struct *work)
> +{
> + struct ixgbevf_adapter *adapter = container_of(work,
> + struct ixgbevf_adapter,
> + migration_task);
> + struct pci_dev *pdev = adapter->pdev;
> + struct net_device *netdev = adapter->netdev;
> + u8 val;
> +
> + if (!test_bit(MIGRATION_IN_PROGRESS, >migration_status)) {
> + pci_read_config_byte(pdev,
> +

Re: use-after-free in sock_wake_async

2015-11-24 Thread Jason Baron



On 11/24/2015 10:21 AM, Eric Dumazet wrote:
> On Tue, Nov 24, 2015 at 6:18 AM, Dmitry Vyukov  wrote:
>> Hello,
>>
>> The following program triggers use-after-free in sock_wake_async:
>>
>> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> long r2 = -1;
>> long r3 = -1;
>> long r7 = -1;
>>
>> void *thr0(void *arg)
>> {
>> syscall(SYS_splice, r2, 0x0ul, r7, 0x0ul, 0x4ul, 0x8ul);
>> return 0;
>> }
>>
>> void *thr1(void *arg)
>> {
>> syscall(SYS_close, r2, 0, 0, 0, 0, 0);
>> return 0;
>> }
>>
>> void *thr2(void *arg)
>> {
>> syscall(SYS_write, r3, 0x20003000ul, 0xe7ul, 0, 0, 0);
>> return 0;
>> }
>>
>> int main()
>> {
>> long r0 = syscall(SYS_mmap, 0x2000ul, 0x1ul, 0x3ul,
>> 0x32ul, 0xul, 0x0ul);
>> long r1 = syscall(SYS_socketpair, 0x1ul, 0x1ul, 0x0ul,
>> 0x2000ul, 0, 0);
>> r2 = *(uint32_t*)0x2000;
>> r3 = *(uint32_t*)0x2004;
>>
>> *(uint64_t*)0x20001000 = 0x4;
>> long r5 = syscall(SYS_ioctl, r2, 0x5452ul, 0x20001000ul, 0, 0, 0);
>>
>> long r6 = syscall(SYS_pipe2, 0x20002000ul, 0x80800ul, 0, 0, 0, 0);
>> r7 = *(uint32_t*)0x20002004;
>>
>> pthread_t th[3];
>> pthread_create([0], 0, thr0, 0);
>> pthread_create([1], 0, thr1, 0);
>> pthread_create([2], 0, thr2, 0);
>> pthread_join(th[0], 0);
>> pthread_join(th[1], 0);
>> pthread_join(th[2], 0);
>> return 0;
>> }
>>
>>
>> The use-after-free fires after a minute of running it in a tight
>> parallel loop. I use the stress utility for this:
>>
>> $ go get golang.org/x/tools/cmd/stress
>> $ stress -p 128 -failure "ignore" ./a.out
>>
>>
>> ==
>> BUG: KASAN: use-after-free in sock_wake_async+0x325/0x340 at addr
>> 880061d1ad10
>> Read of size 8 by task a.out/23178
>> =
>> BUG sock_inode_cache (Not tainted): kasan: bad access detected
>> -
>>
>> Disabling lock debugging due to kernel taint
>> INFO: Allocated in sock_alloc_inode+0x1d/0x220 age=0 cpu=2 pid=23183
>> [<  none  >] kmem_cache_alloc+0x1a6/0x1f0 mm/slub.c:2514
>> [<  none  >] sock_alloc_inode+0x1d/0x220 net/socket.c:250
>> [<  none  >] alloc_inode+0x61/0x180 fs/inode.c:198
>> [<  none  >] new_inode_pseudo+0x17/0xe0 fs/inode.c:878
>> [<  none  >] sock_alloc+0x3d/0x260 net/socket.c:540
>> [<  none  >] __sock_create+0xa7/0x620 net/socket.c:1133
>> [< inline >] sock_create net/socket.c:1209
>> [< inline >] SYSC_socketpair net/socket.c:1281
>> [<  none  >] SyS_socketpair+0x112/0x4e0 net/socket.c:1260
>> [<  none  >] entry_SYSCALL_64_fastpath+0x16/0x7a
>> arch/x86/entry/entry_64.S:185
>>
>> INFO: Freed in sock_destroy_inode+0x56/0x70 age=0 cpu=2 pid=23185
>> [<  none  >] kmem_cache_free+0x24e/0x260 mm/slub.c:2742
>> [<  none  >] sock_destroy_inode+0x56/0x70 net/socket.c:279
>> [<  none  >] destroy_inode+0xc4/0x120 fs/inode.c:255
>> [<  none  >] evict+0x36b/0x580 fs/inode.c:559
>> [< inline >] iput_final fs/inode.c:1477
>> [<  none  >] iput+0x4a0/0x790 fs/inode.c:1504
>> [< inline >] dentry_iput fs/dcache.c:358
>> [<  none  >] __dentry_kill+0x4fe/0x700 fs/dcache.c:543
>> [< inline >] dentry_kill fs/dcache.c:587
>> [<  none  >] dput+0x6ab/0x7a0 fs/dcache.c:796
>> [<  none  >] __fput+0x3fb/0x6e0 fs/file_table.c:226
>> [<  none  >] fput+0x15/0x20 fs/file_table.c:244
>> [<  none  >] task_work_run+0x163/0x1f0 kernel/task_work.c:115
>> (discriminator 1)
>> [< inline >] tracehook_notify_resume include/linux/tracehook.h:191
>> [<  none  >] exit_to_usermode_loop+0x180/0x1a0
>> arch/x86/entry/common.c:251
>> [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282
>> [<  none  >] syscall_return_slowpath+0x19f/0x210
>> arch/x86/entry/common.c:344
>> [<  none  >] int_ret_from_sys_call+0x25/0x9f
>> arch/x86/entry/entry_64.S:281
>>
>> INFO: Slab 0xea0001874600 objects=25 used=2 fp=0x880061d1c100
>> flags=0x5004080
>> INFO: Object 0x880061d1ad00 @offset=11520 fp=0x880061d1a300
>> CPU: 3 PID: 23178 Comm: a.out Tainted: GB   4.4.0-rc1+ #84
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>   880061baf8f0 825d3336 88003e0dc280
>>  880061d1ad00 880061d18000 880061baf920 81618784
>>  88003e0dc280 ea0001874600 880061d1ad00 00e7
>>
>> Call Trace:
>>  [] __asan_report_load8_noabort+0x3e/0x40
>> mm/kasan/report.c:280
>>  [< inline >]

[PATCH 09/13] mm: memcontrol: generalize the socket accounting jump label

2015-11-24 Thread Johannes Weiner

The unified hierarchy memory controller is going to use this jump
label as well to control the networking callbacks. Move it to the
memory controller code and give it a more generic name.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Reviewed-by: Vladimir Davydov 
---
 include/linux/memcontrol.h | 4 
 include/net/sock.h | 7 ---
 mm/memcontrol.c| 3 +++
 net/core/sock.c| 5 -
 net/ipv4/tcp_memcontrol.c  | 4 ++--
 5 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99fefe..dad56ef 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -681,6 +681,8 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback 
*wb,
 
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 struct sock;
+extern struct static_key memcg_sockets_enabled_key;
+#define mem_cgroup_sockets_enabled static_key_false(_sockets_enabled_key)
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
@@ -689,6 +691,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct 
mem_cgroup *memcg)
 {
return memcg->tcp_mem.memory_pressure;
 }
+#else
+#define mem_cgroup_sockets_enabled 0
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 1a94b85..fcc9442 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1065,13 +1065,6 @@ static inline void sk_refcnt_debug_release(const struct 
sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
-extern struct static_key memcg_socket_limit_enabled;
-#define mem_cgroup_sockets_enabled 
static_key_false(_socket_limit_enabled)
-#else
-#define mem_cgroup_sockets_enabled 0
-#endif
-
 static inline bool sk_stream_memory_free(const struct sock *sk)
 {
if (sk->sk_wmem_queued >= sk->sk_sndbuf)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 68d67fc..0602bee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,6 +291,9 @@ static inline struct mem_cgroup 
*mem_cgroup_from_id(unsigned short id)
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
+struct static_key memcg_sockets_enabled_key;
+EXPORT_SYMBOL(memcg_sockets_enabled_key);
+
 void sock_update_memcg(struct sock *sk)
 {
struct mem_cgroup *memcg;
diff --git a/net/core/sock.c b/net/core/sock.c
index 6486b0d..c5435b5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -201,11 +201,6 @@ EXPORT_SYMBOL(sk_net_capable);
 static struct lock_class_key af_family_keys[AF_MAX];
 static struct lock_class_key af_family_slock_keys[AF_MAX];
 
-#if defined(CONFIG_MEMCG_KMEM)
-struct static_key memcg_socket_limit_enabled;
-EXPORT_SYMBOL(memcg_socket_limit_enabled);
-#endif
-
 /*
  * Make lock validator output more readable. (we pre-construct these
  * strings build-time, so that runtime initialization of socket
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index e507825..9a22e2d 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -34,7 +34,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
return;
 
if (memcg->tcp_mem.active)
-   static_key_slow_dec(_socket_limit_enabled);
+   static_key_slow_dec(_sockets_enabled_key);
 }
 
 static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
@@ -65,7 +65,7 @@ static int tcp_update_limit(struct mem_cgroup *memcg, 
unsigned long nr_pages)
 * because when this value change, the code to process it is not
 * patched in yet.
 */
-   static_key_slow_inc(_socket_limit_enabled);
+   static_key_slow_inc(_sockets_enabled_key);
memcg->tcp_mem.active = true;
}
 
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/13] net: tcp_memcontrol: simplify linkage between socket and page counter

2015-11-24 Thread Johannes Weiner

There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future. Remove the indirection and
link sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner 
Reviewed-by: Vladimir Davydov 
---
 include/linux/memcontrol.h   | 18 +++-
 include/net/sock.h   | 36 +++-
 include/net/tcp.h|  4 +--
 include/net/tcp_memcontrol.h |  1 -
 mm/memcontrol.c  | 57 +++--
 net/core/sock.c  | 52 +-
 net/ipv4/tcp_ipv4.c  |  7 +
 net/ipv4/tcp_memcontrol.c| 67 +---
 net/ipv4/tcp_output.c|  4 +--
 net/ipv6/tcp_ipv6.c  |  3 --
 10 files changed, 68 insertions(+), 181 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d80021..d99fefe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -89,16 +89,6 @@ struct cg_proto {
struct page_counter memory_allocated;   /* Current allocated 
memory. */
int memory_pressure;
boolactive;
-   /*
-* memcg field is used to find which memcg we belong directly
-* Each memcg struct can hold more than one cg_proto, so container_of
-* won't really cut.
-*
-* The elegant solution would be having an inverse function to
-* proto_cgroup in struct proto, but that means polluting the structure
-* for everybody, instead of just for memcg users.
-*/
-   struct mem_cgroup   *memcg;
 };
 
 #ifdef CONFIG_MEMCG
@@ -693,11 +683,11 @@ static inline void mem_cgroup_wb_stats(struct 
bdi_writeback *wb,
 struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
-bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
-void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
-static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int 
nr_pages);
+static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
-   return proto->memory_pressure;
+   return memcg->tcp_mem.memory_pressure;
 }
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 888aa3f..1a94b85 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -69,22 +69,6 @@
 #include 
 #include 
 
-struct cgroup;
-struct cgroup_subsys;
-#ifdef CONFIG_NET
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys 
*ss);
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg);
-#else
-static inline
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-   return 0;
-}
-static inline
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
-{
-}
-#endif
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -310,7 +294,7 @@ struct cg_proto;
   *@sk_security: used by security modules
   *@sk_mark: generic packet mark
   *@sk_classid: this socket's cgroup classid
-  *@sk_cgrp: this socket's cgroup-specific proto data
+  *@sk_memcg: this socket's memory cgroup association
   *@sk_write_pending: a write to stream socket waits to start
   *@sk_state_change: callback to indicate change in the state of the sock
   *@sk_data_ready: callback to indicate there is data to be processed
@@ -447,7 +431,7 @@ struct sock {
 #ifdef CONFIG_CGROUP_NET_CLASSID
u32 sk_classid;
 #endif
-   struct cg_proto *sk_cgrp;
+   struct mem_cgroup   *sk_memcg;
void(*sk_state_change)(struct sock *sk);
void(*sk_data_ready)(struct sock *sk);
void(*sk_write_space)(struct sock *sk);
@@ -1051,18 +1035,6 @@ struct proto {
 #ifdef SOCK_REFCNT_DEBUG
atomic_tsocks;
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-   /*
-* cgroup specific init/deinit functions. Called once for all
-* protocols that implement it, from cgroups populate function.
-* This function has to setup any files the protocol want to
-* appear in the kmem cgroup filesystem.
-*/
-   int (*init_cgroup)(struct mem_cgroup *memcg,
-  struct cgroup_subsys *ss);
-   void(*destroy_cgroup)(struct mem_cgroup *memcg);
-   struct cg_proto *(*proto_cgroup)(struct mem_cgroup *memcg);
-#endif
 };
 
 int proto_register(struct proto *prot, int alloc_slab);
@@ -1126,8 +1098,8 @@ static inline

[PATCH 11/13] mm: memcontrol: move socket code for unified hierarchy accounting

2015-11-24 Thread Johannes Weiner

The unified hierarchy memory controller will account socket
memory. Move the infrastructure functions accordingly.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Reviewed-by: Vladimir Davydov 
---
 mm/memcontrol.c | 148 
 1 file changed, 74 insertions(+), 74 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6b8c0f7..ed030b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,80 +294,6 @@ static inline struct mem_cgroup 
*mem_cgroup_from_id(unsigned short id)
return mem_cgroup_from_css(css);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
-
-struct static_key memcg_sockets_enabled_key;
-EXPORT_SYMBOL(memcg_sockets_enabled_key);
-
-void sock_update_memcg(struct sock *sk)
-{
-   struct mem_cgroup *memcg;
-
-   /* Socket cloning can throw us here with sk_cgrp already
-* filled. It won't however, necessarily happen from
-* process context. So the test for root memcg given
-* the current task's memcg won't help us in this case.
-*
-* Respecting the original socket's memcg is a better
-* decision in this case.
-*/
-   if (sk->sk_memcg) {
-   BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
-   css_get(>sk_memcg->css);
-   return;
-   }
-
-   rcu_read_lock();
-   memcg = mem_cgroup_from_task(current);
-   if (memcg != root_mem_cgroup &&
-   memcg->tcp_mem.active &&
-   css_tryget_online(>css))
-   sk->sk_memcg = memcg;
-   rcu_read_unlock();
-}
-EXPORT_SYMBOL(sock_update_memcg);
-
-void sock_release_memcg(struct sock *sk)
-{
-   WARN_ON(!sk->sk_memcg);
-   css_put(>sk_memcg->css);
-}
-
-/**
- * mem_cgroup_charge_skmem - charge socket memory
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- *
- * Charges @nr_pages to @memcg. Returns %true if the charge fit within
- * @memcg's configured limit, %false if the charge had to be forced.
- */
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-   struct page_counter *counter;
-
-   if (page_counter_try_charge(>tcp_mem.memory_allocated,
-   nr_pages, )) {
-   memcg->tcp_mem.memory_pressure = 0;
-   return true;
-   }
-   page_counter_charge(>tcp_mem.memory_allocated, nr_pages);
-   memcg->tcp_mem.memory_pressure = 1;
-   return false;
-}
-
-/**
- * mem_cgroup_uncharge_skmem - uncharge socket memory
- * @memcg - memcg to uncharge
- * @nr_pages - number of pages to uncharge
- */
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-   page_counter_uncharge(>tcp_mem.memory_allocated, nr_pages);
-}
-
-#endif
-
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
@@ -5544,6 +5470,80 @@ void mem_cgroup_replace_page(struct page *oldpage, 
struct page *newpage)
commit_charge(newpage, memcg, true);
 }
 
+/* Writing them here to avoid exposing memcg's inner layout */
+#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+
+struct static_key memcg_sockets_enabled_key;
+EXPORT_SYMBOL(memcg_sockets_enabled_key);
+
+void sock_update_memcg(struct sock *sk)
+{
+   struct mem_cgroup *memcg;
+
+   /* Socket cloning can throw us here with sk_cgrp already
+* filled. It won't however, necessarily happen from
+* process context. So the test for root memcg given
+* the current task's memcg won't help us in this case.
+*
+* Respecting the original socket's memcg is a better
+* decision in this case.
+*/
+   if (sk->sk_memcg) {
+   BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
+   css_get(>sk_memcg->css);
+   return;
+   }
+
+   rcu_read_lock();
+   memcg = mem_cgroup_from_task(current);
+   if (memcg != root_mem_cgroup &&
+   memcg->tcp_mem.active &&
+   css_tryget_online(>css))
+   sk->sk_memcg = memcg;
+   rcu_read_unlock();
+}
+EXPORT_SYMBOL(sock_update_memcg);
+
+void sock_release_memcg(struct sock *sk)
+{
+   WARN_ON(!sk->sk_memcg);
+   css_put(>sk_memcg->css);
+}
+
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+   struct page_counter *counter;
+
+   if (page_counter_try_charge(>tcp_mem.memory_allocated,
+   nr_pages, )) {
+   memcg->tcp_mem.memory_pressure = 0;
+

Re: pull request: bluetooth-next 2015-11-23

2015-11-24 Thread David Miller

From: Johan Hedberg 
Date: Mon, 23 Nov 2015 15:55:33 +0200

> Here's the first bluetooth-next pull request for the 4.5 kernel.
> 
>  - Add new Get Advertising Size Information management command
>  - Add support for new system note message type on monitor channel
>  - Refactor LE scan changes behind separate workqueue to avoid races
>  - Fix issue with privacy feature when powering on adapter
>  - Various minor fixes & cleanups here and there
> 
> Please let me know if there are any issues pulling. Thanks.

Pulled, thanks Johan.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/4] sh_eth: Remove obsolete platform_device_id entries

2015-11-24 Thread David Miller

From: Geert Uytterhoeven 
Date: Tue, 24 Nov 2015 15:40:56 +0100

> Since commit 3d7608e4c169af03 ("ARM: shmobile: bockw: remove legacy
> board file and config"), which is in v4.4-rc1, shmobile SoCs are only
> supported in generic DT-only ARM multi-platform builds.  The sh_eth
> driver doesn't need to match platform devices by name anymore, hence
> this series removes the corresponding platform_device_id entries.
> 
> Changes since v2:
>   - More Acks,
>   - Platform dependency has entered mainline,
> 
> Changes since v1:
>   - Protect some data and functions by #ifdef CONFIG_OF to silence
> unused compiler warnings on SH,
>   - New patches 3 and 4.
> 
> Thanks for applying!

Series applied to net-next, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Increasing skb->mark size

2015-11-24 Thread Matt Bennett

On Tue, 2015-11-24 at 21:36 +0100, Florian Westphal wrote:
> Matt Bennett  wrote:
> > I'm emailing this list for feedback on the feasibility of increasing
> > skb->mark or adding a new field for marking. Perhaps this extension
> > could be done under a new CONFIG option. Perhaps there are other ways we
> > could achieve the desired behaviour?
> 
> Well I pointed you towards connlabels which provide 128 bit of space
> in the conntrack extension area but you did not tell me why you cannot
> use it.
Sorry, I moved the discussion to this list to hopefully gather some new
ideas/opinions.

While connlabels provide 128bits of space skb->mark is still only 32
bits. Since we are using connection tracking to simply restore skb->mark
the use of connlabels by itself doesn't solve the problem I outlined
above. skb->mark would still needs to be increased in size.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] drivers: net: xgene: fix: ifconfig up/down crash

2015-11-24 Thread David Miller

From: Iyappan Subramanian 
Date: Mon, 23 Nov 2015 12:04:52 -0800

> Fixing kernel crash when doing ifconfig down and up in a loop,
 ...
> The fix was to reorder napi_enable, napi_disable, request_irq and
> free_irq calls, move register_netdev after dma_coerce_mask_and_coherent.
> 
> Signed-off-by: Iyappan Subramanian 
> Tested-by: Khuong Dinh 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] ipv6: distinguish frag queues by device for multicast and link-local packets

2015-11-24 Thread David Miller

From: Michal Kubecek 
Date: Tue, 24 Nov 2015 15:07:11 +0100 (CET)

> If a fragmented multicast packet is received on an ethernet device which
> has an active macvlan on top of it, each fragment is duplicated and
> received both on the underlying device and the macvlan. If some
> fragments for macvlan are processed before the whole packet for the
> underlying device is reassembled, the "overlapping fragments" test in
> ip6_frag_queue() discards the whole fragment queue.
> 
> To resolve this, add device ifindex to the search key and require it to
> match reassembling multicast packets and packets to link-local
> addresses.
> 
> Note: similar patch has been already submitted by Yoshifuji Hideaki in
> 
>   http://patchwork.ozlabs.org/patch/220979/
> 
> but got lost and forgotten for some reason.
> 
> Signed-off-by: Michal Kubecek 

This is definitely the right thing to do and matches how ipv4 keys
fragments.

Applied and queued up for -stable, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/13] net: tcp_memcontrol: sanitize tcp memory accounting callbacks

2015-11-24 Thread Johannes Weiner

There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things
unnecessarily. Replace this with simple and clear charge and uncharge
calls--hidden behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle
the global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit,
and thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets
is generally questionable. However, we did it until now, so we
continue to enter it when the hard limit is hit, and packets are
dropped, to let other sockets in the cgroup know that they shouldn't
grow their transmit windows, either. However, keep it simple in the
new callback model and leave memory pressure lazily when the next
packet is accepted (as opposed to doing it synchroneously when packets
are processed). When packets are dropped, network performance will
already be in the toilet, so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner 
Reviewed-by: Vladimir Davydov 
---
 include/linux/memcontrol.h | 12 -
 include/net/sock.h | 64 ++
 include/net/tcp.h  |  5 ++--
 mm/memcontrol.c| 32 +++
 net/core/sock.c| 26 +++
 net/ipv4/tcp_output.c  |  7 +++--
 6 files changed, 70 insertions(+), 76 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1a658be..4d80021 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -664,12 +664,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum 
vm_event_item idx)
 }
 #endif /* CONFIG_MEMCG */
 
-enum {
-   UNDER_LIMIT,
-   SOFT_LIMIT,
-   OVER_LIMIT,
-};
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
@@ -699,6 +693,12 @@ static inline void mem_cgroup_wb_stats(struct 
bdi_writeback *wb,
 struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
+bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
+static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
+{
+   return proto->memory_pressure;
+}
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 0b333c2..888aa3f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1126,8 +1126,9 @@ static inline bool sk_under_memory_pressure(const struct 
sock *sk)
if (!sk->sk_prot->memory_pressure)
return false;
 
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   return !!sk->sk_cgrp->memory_pressure;
+   if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
+   mem_cgroup_under_socket_pressure(sk->sk_cgrp))
+   return true;
 
return !!*sk->sk_prot->memory_pressure;
 }
@@ -1141,9 +1142,6 @@ static inline void sk_leave_memory_pressure(struct sock 
*sk)
 
if (*memory_pressure)
*memory_pressure = 0;
-
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   sk->sk_cgrp->memory_pressure = 0;
 }
 
 static inline void sk_enter_memory_pressure(struct sock *sk)
@@ -1151,76 +1149,30 @@ static inline void sk_enter_memory_pressure(struct sock 
*sk)
if (!sk->sk_prot->enter_memory_pressure)
return;
 
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   sk->sk_cgrp->memory_pressure = 1;
-
sk->sk_prot->enter_memory_pressure(sk);
 }
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-   long limit = sk->sk_prot->sysctl_mem[index];
-
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   limit = min_t(long, limit, sk->sk_cgrp->memory_allocated.limit);
-
-   return limit;
-}
-
-static inline void memcg_memory_allocated_add(struct cg_proto *prot,
- unsigned long amt,
- int *parent_status)
-{
-   struct

[PATCH 10/13] mm: memcontrol: do not account memory+swap on unified hierarchy

2015-11-24 Thread Johannes Weiner

The unified hierarchy memory controller doesn't expose the memory+swap
counter to userspace, but its accounting is hardcoded in all charge
paths right now, including the per-cpu charge cache ("the stock").

To avoid adding yet more pointless memory+swap accounting with the
socket memory support in unified hierarchy, disable the counter
altogether when in unified hierarchy mode.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Reviewed-by: Vladimir Davydov 
---
 mm/memcontrol.c | 44 +---
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0602bee..6b8c0f7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,12 @@ int do_swap_account __read_mostly;
 #define do_swap_account0
 #endif
 
+/* Whether legacy memory+swap accounting is active */
+static bool do_memsw_account(void)
+{
+   return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && do_swap_account;
+}
+
 static const char * const mem_cgroup_stat_names[] = {
"cache",
"rss",
@@ -1177,7 +1183,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup 
*memcg)
if (count < limit)
margin = limit - count;
 
-   if (do_swap_account) {
+   if (do_memsw_account()) {
count = page_counter_read(>memsw);
limit = READ_ONCE(memcg->memsw.limit);
if (count <= limit)
@@ -1280,7 +1286,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
struct task_struct *p)
pr_cont(":");
 
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-   if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+   if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
continue;
pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
K(mem_cgroup_read_stat(iter, i)));
@@ -1903,7 +1909,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 
if (stock->nr_pages) {
page_counter_uncharge(>memory, stock->nr_pages);
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_uncharge(>memsw, stock->nr_pages);
css_put_many(>css, stock->nr_pages);
stock->nr_pages = 0;
@@ -2033,11 +2039,11 @@ retry:
if (consume_stock(memcg, nr_pages))
return 0;
 
-   if (!do_swap_account ||
+   if (!do_memsw_account() ||
page_counter_try_charge(>memsw, batch, )) {
if (page_counter_try_charge(>memory, batch, ))
goto done_restock;
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_uncharge(>memsw, batch);
mem_over_limit = mem_cgroup_from_counter(counter, memory);
} else {
@@ -2124,7 +2130,7 @@ force:
 * temporarily by force charging it.
 */
page_counter_charge(>memory, nr_pages);
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_charge(>memsw, nr_pages);
css_get_many(>css, nr_pages);
 
@@ -2161,7 +2167,7 @@ static void cancel_charge(struct mem_cgroup *memcg, 
unsigned int nr_pages)
return;
 
page_counter_uncharge(>memory, nr_pages);
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_uncharge(>memsw, nr_pages);
 
css_put_many(>css, nr_pages);
@@ -2447,7 +2453,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
page_counter_uncharge(>kmem, nr_pages);
page_counter_uncharge(>memory, nr_pages);
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_uncharge(>memsw, nr_pages);
 
page->mem_cgroup = NULL;
@@ -3160,7 +3166,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
 
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-   if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+   if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
continue;
seq_printf(m, "%s %lu\n", mem_cgroup_stat_names[i],
   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
@@ -3182,14 +3188,14 @@ static int memcg_stat_show(struct seq_file *m, void *v)
}
seq_printf(m, "hierarchical_memory_limit %llu\n",
   (u64)memory * PAGE_SIZE);
-   if (do_swap_account)
+   if (do_memsw_account())
seq_printf(m, "hierarchical_memsw_limit %llu\n",
   (u64)memsw * PAGE_SIZE);
 
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
unsigned long long val = 0;
 
-   if (i ==

[PATCH 06/13] net: tcp_memcontrol: simplify the per-memcg limit access

2015-11-24 Thread Johannes Weiner

tcp_memcontrol replicates the global sysctl_mem limit array per
cgroup, but it only ever sets these entries to the value of the
memory_allocated page_counter limit. Use the latter directly.

Signed-off-by: Johannes Weiner 
Reviewed-by: Vladimir Davydov 
---
 include/linux/memcontrol.h | 1 -
 include/net/sock.h | 8 +---
 net/ipv4/tcp_memcontrol.c  | 8 
 3 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cc45407..1a658be 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -89,7 +89,6 @@ struct cg_proto {
struct page_counter memory_allocated;   /* Current allocated 
memory. */
int memory_pressure;
boolactive;
-   longsysctl_mem[3];
/*
 * memcg field is used to find which memcg we belong directly
 * Each memcg struct can hold more than one cg_proto, so container_of
diff --git a/include/net/sock.h b/include/net/sock.h
index 7afbdab..0b333c2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1159,10 +1159,12 @@ static inline void sk_enter_memory_pressure(struct sock 
*sk)
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-   long *prot = sk->sk_prot->sysctl_mem;
+   long limit = sk->sk_prot->sysctl_mem[index];
+
if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   prot = sk->sk_cgrp->sysctl_mem;
-   return prot[index];
+   limit = min_t(long, limit, sk->sk_cgrp->memory_allocated.limit);
+
+   return limit;
 }
 
 static inline void memcg_memory_allocated_add(struct cg_proto *prot,
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 6759e0d..ef4268d 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -21,9 +21,6 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct 
cgroup_subsys *ss)
if (!cg_proto)
return 0;
 
-   cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0];
-   cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1];
-   cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2];
cg_proto->memory_pressure = 0;
cg_proto->memcg = memcg;
 
@@ -54,7 +51,6 @@ EXPORT_SYMBOL(tcp_destroy_cgroup);
 static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 {
struct cg_proto *cg_proto;
-   int i;
int ret;
 
cg_proto = tcp_prot.proto_cgroup(memcg);
@@ -65,10 +61,6 @@ static int tcp_update_limit(struct mem_cgroup *memcg, 
unsigned long nr_pages)
if (ret)
return ret;
 
-   for (i = 0; i < 3; i++)
-   cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
-   sysctl_tcp_mem[i]);
-
if (!cg_proto->active) {
/*
 * The active flag needs to be written after the static_key
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] ipv6: distinguish frag queues by device for multicast and link-local packets

2015-11-24 Thread Hannes Frederic Sowa

On Tue, Nov 24, 2015, at 22:46, David Miller wrote:
> From: Michal Kubecek 
> Date: Tue, 24 Nov 2015 15:07:11 +0100 (CET)
> 
> > If a fragmented multicast packet is received on an ethernet device which
> > has an active macvlan on top of it, each fragment is duplicated and
> > received both on the underlying device and the macvlan. If some
> > fragments for macvlan are processed before the whole packet for the
> > underlying device is reassembled, the "overlapping fragments" test in
> > ip6_frag_queue() discards the whole fragment queue.
> > 
> > To resolve this, add device ifindex to the search key and require it to
> > match reassembling multicast packets and packets to link-local
> > addresses.
> > 
> > Note: similar patch has been already submitted by Yoshifuji Hideaki in
> > 
> >   http://patchwork.ozlabs.org/patch/220979/
> > 
> > but got lost and forgotten for some reason.
> > 
> > Signed-off-by: Michal Kubecek 
> 
> This is definitely the right thing to do and matches how ipv4 keys
> fragments.
> 
> Applied and queued up for -stable, thanks!

I reviewed it earlier and agree last time that this patch is necessary.
Unfortunately forgot to ack before. :(

Acked-by: Hannes Frederic Sowa 

In IPv4 as in IPv6 global addresses we have to expect packets coming
over multiple interfaces, it is only correct for local and multicast
scoped addresses. In IPv4 we don't really key the device index, only in
case of an vrf interface.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/13] mm: memcontrol: export root_mem_cgroup

2015-11-24 Thread Johannes Weiner

A later patch will need this symbol in files other than memcontrol.c,
so export it now and replace mem_cgroup_root_css at the same time.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: David S. Miller 
Reviewed-by: Vladimir Davydov 
---
 include/linux/memcontrol.h | 3 ++-
 mm/backing-dev.c   | 2 +-
 mm/memcontrol.c| 5 ++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9d5472b..320b690 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -265,7 +265,8 @@ struct mem_cgroup {
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
 };
-extern struct cgroup_subsys_state *mem_cgroup_root_css;
+
+extern struct mem_cgroup *root_mem_cgroup;
 
 /**
  * mem_cgroup_events - count memory events against a cgroup
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9160853..fdc6f4d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -707,7 +707,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
ret = wb_init(>wb, bdi, 1, GFP_KERNEL);
if (!ret) {
-   bdi->wb.memcg_css = mem_cgroup_root_css;
+   bdi->wb.memcg_css = _mem_cgroup->css;
bdi->wb.blkcg_css = blkcg_root_css;
}
return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 79a29d5..f6ea649 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -76,9 +76,9 @@
 struct cgroup_subsys memory_cgrp_subsys __read_mostly;
 EXPORT_SYMBOL(memory_cgrp_subsys);
 
+struct mem_cgroup *root_mem_cgroup __read_mostly;
+
 #define MEM_CGROUP_RECLAIM_RETRIES 5
-static struct mem_cgroup *root_mem_cgroup __read_mostly;
-struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
 
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
@@ -4217,7 +4217,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
/* root ? */
if (parent_css == NULL) {
root_mem_cgroup = memcg;
-   mem_cgroup_root_css = >css;
page_counter_init(>memory, NULL);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/13] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-24 Thread Johannes Weiner

Let the networking stack know when a memcg is under reclaim pressure
so that it can clamp its transmit windows accordingly.

Whenever the reclaim efficiency of a cgroup's LRU lists drops low
enough for a MEDIUM or HIGH vmpressure event to occur, assert a
pressure state in the socket and tcp memory code that tells it to curb
consumption growth from sockets associated with said control group.

Traditionally, vmpressure reports for the entire subtree of a memcg
under pressure, which drops useful information on the individual
groups reclaimed. However, it's too late to change the userinterface,
so add a second reporting mode that reports on the level of reclaim
instead of at the level of pressure, and use that report for sockets.

vmpressure events are naturally edge triggered, so for hysteresis
assert socket pressure for a second to allow for subsequent vmpressure
events to occur before letting the socket code return to normal.

This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 32 ---
 include/linux/vmpressure.h |  5 ++-
 mm/memcontrol.c| 17 ++
 mm/vmpressure.c| 78 +++---
 mm/vmscan.c| 10 +-
 5 files changed, 103 insertions(+), 39 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fae0aaf..a8df46c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -249,6 +249,10 @@ struct mem_cgroup {
struct wb_domain cgwb_domain;
 #endif
 
+#ifdef CONFIG_INET
+   unsigned long   socket_pressure;
+#endif
+
/* List of events which userspace want to receive */
struct list_head event_list;
spinlock_t event_list_lock;
@@ -292,18 +296,34 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, 
struct zone *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+#define mem_cgroup_from_counter(counter, member)   \
+   container_of(counter, struct mem_cgroup, member)
+
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
   struct mem_cgroup *,
   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+/**
+ * parent_mem_cgroup - find the accounting parent of a memcg
+ * @memcg: memcg whose parent to find
+ *
+ * Returns the parent memcg, or NULL if this is the root or the memory
+ * controller is in legacy no-hierarchy mode.
+ */
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+   if (!memcg->memory.parent)
+   return NULL;
+   return mem_cgroup_from_counter(memcg->memory.parent, memory);
+}
+
 static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
  struct mem_cgroup *root)
 {
@@ -693,10 +713,14 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, 
unsigned int nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
 #ifdef CONFIG_MEMCG_KMEM
-   return memcg->tcp_mem.memory_pressure;
-#else
-   return false;
+   if (memcg->tcp_mem.memory_pressure)
+   return true;
 #endif
+   do {
+   if (time_before(jiffies, memcg->socket_pressure))
+   return true;
+   } while ((memcg = parent_mem_cgroup(memcg)));
+   return false;
 }
 #else
 #define mem_cgroup_sockets_enabled 0
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3e45358..a77b142 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -12,6 +12,9 @@
 struct vmpressure {
unsigned long scanned;
unsigned long reclaimed;
+
+   unsigned long tree_scanned;
+   unsigned long tree_reclaimed;
/* The lock is used to keep the scanned/reclaimed above in sync. */
struct spinlock sr_lock;
 
@@ -26,7 +29,7 @@ struct vmpressure {
 struct mem_cgroup;
 
 #ifdef CONFIG_MEMCG
-extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
   unsigned long scanned, unsigned long reclaimed);
 extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 59555b0..a0da91f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1091,9 +1091,6 @@ bool task_in_mem_cgroup(struct task_struct *task, struct 
mem_cgroup *memcg)

Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Alexei Starovoitov

On Tue, Nov 24, 2015 at 08:16:25PM +0100, Hannes Frederic Sowa wrote:
> Hello,
> 
> On Tue, Nov 24, 2015, at 19:59, Alexei Starovoitov wrote:
> > On Tue, Nov 24, 2015 at 07:23:30PM +0100, Hannes Frederic Sowa wrote:
> > > Hello,
> > > 
> > > On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> > > > Its a well-written document, but I don't see how moving the burden of
> > > > locking a single logical tcp connection (to prevent threads from
> > > > reading a partial record) from userspace to kernel is an improvement.
> > > > 
> > > > If you really have 100 threads and must use a single tcp connection
> > > > to multiplex some arbitrarily complex record-format in atomic fashion,
> > > > then your requirements suck.
> > > 
> > > Right, if we are in a datacenter I would probably write a script and use
> > > all those IPv6 addresses to set up mappings a la:
> > > 
> > > for each $cpu; do
> > >   $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
> > > done
> > 
> > interesting idea, but then remote host will be influencing local cpu
> > selection?
> > how remote can figure out the number of local cpus?
> 
> Via rpc! :)
> 
> The configuration shouldn't change all the time and some get_info rpc
> call could provide info for the topology of the machine, or...

Configuration changes all the time. Machines crash, traffic redirected
because of load, etc, etc

> > Consider scenario where you have a ton of tcp sockets feeding into
> > bigger or smaller set of kcm sockets processed by threads or fibers.
> > Pinning sockets to cpu is not going to work.
> > 
> > Also note that opimizing byte copies between kernel and user space is
> > important,
> > but we lose a lot more in user space due to scheduling and re-scheduling
> > when demux-ing user space thread is feeding other worker threads.
> 
> ...also ipvs/netfilter could be used to only inspect the header and
> reroute the packet to some better fitting CPU. Complete hierarchies
> could be build with NUMA and addresses, packets could be rerouted into
> namespaces, etc.

or tc+bpf redirect...
but the reason it won't work is the same as af_packet+bpf fanout doesn't apply:
It's not packet based demuxing.
Kernel needs to deal with TCP stream first and different messages within single
TCP stream go to different workers.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Increasing skb->mark size

2015-11-24 Thread Florian Westphal

Matt Bennett  wrote:
> I'm emailing this list for feedback on the feasibility of increasing
> skb->mark or adding a new field for marking. Perhaps this extension
> could be done under a new CONFIG option. Perhaps there are other ways we
> could achieve the desired behaviour?

Well I pointed you towards connlabels which provide 128 bit of space
in the conntrack extension area but you did not tell me why you cannot
use it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sctp_do_sm

2015-11-24 Thread Neil Horman

On Tue, Nov 24, 2015 at 11:10:32AM +0100, Dmitry Vyukov wrote:
> On Tue, Nov 24, 2015 at 10:31 AM, Dmitry Vyukov  wrote:
> > On Tue, Nov 24, 2015 at 10:15 AM, Dmitry Vyukov  wrote:
> >> Hello,
> >>
> >> The following program triggers use-after-free in sctp_do_sm:
> >>
> >> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> >> #include 
> >> #include 
> >> #include 
> >>
> >> int main()
> >> {
> >> long r0 = syscall(SYS_socket, 0xaul, 0x80805ul, 0x0ul, 0, 0, 0);
> >> long r1 = syscall(SYS_mmap, 0x2000ul, 0x1ul, 0x3ul,
> >> 0x32ul, 0xul, 0x0ul);
> >> memcpy((void*)0x20002fe4,
> >> "\x0a\x00\x33\xe7\xeb\x9d\xcf\x61\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xc5\xc8\x88\x64",
> >> 28);
> >> long r3 = syscall(SYS_bind, r0, 0x20002fe4ul, 0x1cul, 0, 0, 0);
> >> memcpy((void*)0x2faa,
> >> "\x9b\x01\x7d\xcd\xb8\x6a\xc7\x3d\x09\x3a\x07\x00\xa7\xc4\xe9\xee\x0a\xd6\xec\xde\x26\x75\x5f\x22\xae\x4e\x33\x00\xb0\x76\x10\x70\xd6\xca\x19\xbc\x15\x83\xcf\x2e\xbc\x99\x0c\x5e\x83\x89\xc1\x44\x9c\x6e\x74\xd8\x5d\x5d\xd0\xf0\xdf\x47\xc0\x00\x71\x0b\x55\x4c\xab\xf0\xd8\x90\xd5\x92\x8c\x6e\x33\x22\x15\x5b\x19\xfb\xed\xdd\xa6\xac\xcb\x60\xcf\xe2\xde\xed\xdb\x95\x5c\xaa\x20\xa3",
> >> 94);
> >> memcpy((void*)0x233a,
> >> "\x02\x00\x33\xe2\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
> >> 128);
> >> long r6 = syscall(SYS_sendto, r0, 0x2faaul, 0x5eul,
> >> 0x81ul, 0x233aul, 0x80ul);
> >> return 0;
> >> }
> >>
> >>
> >> ==
> >> BUG: KASAN: use-after-free in sctp_do_sm+0x42f6/0x4f60 at addr 
> >> 880036fa80a8
> >> Read of size 4 by task a.out/5664
> >> =
> >> BUG kmalloc-4096 (Tainted: GB  ): kasan: bad access detected
> >> -
> >>
> >> INFO: Allocated in sctp_association_new+0x6f/0x1ea0 age=8 cpu=1 pid=5664
> >> [<  none  >] kmem_cache_alloc_trace+0x1cf/0x220 ./mm/slab.c:3707
> >> [<  none  >] sctp_association_new+0x6f/0x1ea0
> >> [<  none  >] sctp_sendmsg+0x1954/0x28e0
> >> [<  none  >] inet_sendmsg+0x316/0x4f0 ./net/ipv4/af_inet.c:802
> >> [< inline >] __sock_sendmsg_nosec ./net/socket.c:641
> >> [< inline >] __sock_sendmsg ./net/socket.c:651
> >> [<  none  >] sock_sendmsg+0xca/0x110 ./net/socket.c:662
> >> [<  none  >] SYSC_sendto+0x208/0x350 ./net/socket.c:1841
> >> [<  none  >] SyS_sendto+0x40/0x50 ./net/socket.c:1862
> >> [<  none  >] entry_SYSCALL_64_fastpath+0x16/0x7a
> >>
> >> INFO: Freed in sctp_association_put+0x150/0x250 age=14 cpu=1 pid=5664
> >> [<  none  >] kfree+0x199/0x1b0 ./mm/slab.c:1211
> >> [<  none  >] sctp_association_put+0x150/0x250
> >> [<  none  >] sctp_association_free+0x498/0x630
> >> [<  none  >] sctp_do_sm+0xd8b/0x4f60
> >> [<  none  >] sctp_primitive_SHUTDOWN+0xa9/0xd0
> >> [<  none  >] sctp_close+0x616/0x790
> >> [<  none  >] inet_release+0xed/0x1c0 ./net/ipv4/af_inet.c:471
> >> [<  none  >] inet6_release+0x50/0x70 ./net/ipv6/af_inet6.c:416
> >> [< inline >] constant_test_bit 
> >> ././arch/x86/include/asm/bitops.h:321
> >> [<  none  >] sock_release+0x8d/0x200 ./net/socket.c:601
> >> [<  none  >] sock_close+0x16/0x20 ./net/socket.c:1188
> >> [<  none  >] __fput+0x21d/0x6e0 ./fs/file_table.c:265
> >> [<  none  >] fput+0x15/0x20 ./fs/file_table.c:84
> >> [<  none  >] task_work_run+0x163/0x1f0 
> >> ./include/trace/events/rcu.h:20
> >> [< inline >] __list_add ./include/linux/list.h:42
> >> [< inline >] list_add_tail ./include/linux/list.h:76
> >> [< inline >] list_move_tail ./include/linux/list.h:168
> >> [< inline >] reparent_leader ./kernel/exit.c:618
> >> [< inline >] forget_original_parent ./kernel/exit.c:669
> >> [< inline >] exit_notify ./kernel/exit.c:697
> >> [<  none  >] do_exit+0x809/0x2b90 ./kernel/exit.c:878
> >> [<  none  >] do_group_exit+0x108/0x320 ./kernel/exit.c:985
> >>
> >> INFO: Slab 0xeadbea00 objects=7 used=1 fp=0x880036fa8000
> >> flags=0x1004080
> >> INFO: Object 0x880036fa8000 @offset=0 fp=0x880036fad668
> >> CPU: 1 PID: 5664 Comm: a.out Tainted: G

Re: use-after-free in sctp_do_sm

2015-11-24 Thread Eric Dumazet

On Tue, 2015-11-24 at 15:45 -0500, Neil Horman wrote:
> On Tue, Nov 24, 2015 at 11:10:32AM +0100, Dmitry Vyukov wrote:
> > On Tue, Nov 24, 2015 at 10:31 AM, Dmitry Vyukov  wrote:
> > > On Tue, Nov 24, 2015 at 10:15 AM, Dmitry Vyukov  
> > > wrote:
> > >> Hello,
> > >>
> > >> The following program triggers use-after-free in sctp_do_sm:
> > >>
> > >> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> > >> #include 
> > >> #include 
> > >> #include 
> > >>
> > >> int main()
> > >> {
> > >> long r0 = syscall(SYS_socket, 0xaul, 0x80805ul, 0x0ul, 0, 0, 0);
> > >> long r1 = syscall(SYS_mmap, 0x2000ul, 0x1ul, 0x3ul,
> > >> 0x32ul, 0xul, 0x0ul);
> > >> memcpy((void*)0x20002fe4,
> > >> "\x0a\x00\x33\xe7\xeb\x9d\xcf\x61\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xc5\xc8\x88\x64",
> > >> 28);
> > >> long r3 = syscall(SYS_bind, r0, 0x20002fe4ul, 0x1cul, 0, 0, 0);
> > >> memcpy((void*)0x2faa,
> > >> "\x9b\x01\x7d\xcd\xb8\x6a\xc7\x3d\x09\x3a\x07\x00\xa7\xc4\xe9\xee\x0a\xd6\xec\xde\x26\x75\x5f\x22\xae\x4e\x33\x00\xb0\x76\x10\x70\xd6\xca\x19\xbc\x15\x83\xcf\x2e\xbc\x99\x0c\x5e\x83\x89\xc1\x44\x9c\x6e\x74\xd8\x5d\x5d\xd0\xf0\xdf\x47\xc0\x00\x71\x0b\x55\x4c\xab\xf0\xd8\x90\xd5\x92\x8c\x6e\x33\x22\x15\x5b\x19\xfb\xed\xdd\xa6\xac\xcb\x60\xcf\xe2\xde\xed\xdb\x95\x5c\xaa\x20\xa3",
> > >> 94);
> > >> memcpy((void*)0x233a,
> > >> "\x02\x00\x33\xe2\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
> > >> 128);
> > >> long r6 = syscall(SYS_sendto, r0, 0x2faaul, 0x5eul,
> > >> 0x81ul, 0x233aul, 0x80ul);
> > >> return 0;
> > >> }
> > >>
> > >>
> > >> ==
> > >> BUG: KASAN: use-after-free in sctp_do_sm+0x42f6/0x4f60 at addr 
> > >> 880036fa80a8
> > >> Read of size 4 by task a.out/5664
> > >> =
> > >> BUG kmalloc-4096 (Tainted: GB  ): kasan: bad access detected
> > >> -
> > >>
> > >> INFO: Allocated in sctp_association_new+0x6f/0x1ea0 age=8 cpu=1 pid=5664
> > >> [<  none  >] kmem_cache_alloc_trace+0x1cf/0x220 ./mm/slab.c:3707
> > >> [<  none  >] sctp_association_new+0x6f/0x1ea0
> > >> [<  none  >] sctp_sendmsg+0x1954/0x28e0
> > >> [<  none  >] inet_sendmsg+0x316/0x4f0 ./net/ipv4/af_inet.c:802
> > >> [< inline >] __sock_sendmsg_nosec ./net/socket.c:641
> > >> [< inline >] __sock_sendmsg ./net/socket.c:651
> > >> [<  none  >] sock_sendmsg+0xca/0x110 ./net/socket.c:662
> > >> [<  none  >] SYSC_sendto+0x208/0x350 ./net/socket.c:1841
> > >> [<  none  >] SyS_sendto+0x40/0x50 ./net/socket.c:1862
> > >> [<  none  >] entry_SYSCALL_64_fastpath+0x16/0x7a
> > >>
> > >> INFO: Freed in sctp_association_put+0x150/0x250 age=14 cpu=1 pid=5664
> > >> [<  none  >] kfree+0x199/0x1b0 ./mm/slab.c:1211
> > >> [<  none  >] sctp_association_put+0x150/0x250
> > >> [<  none  >] sctp_association_free+0x498/0x630
> > >> [<  none  >] sctp_do_sm+0xd8b/0x4f60
> > >> [<  none  >] sctp_primitive_SHUTDOWN+0xa9/0xd0
> > >> [<  none  >] sctp_close+0x616/0x790
> > >> [<  none  >] inet_release+0xed/0x1c0 ./net/ipv4/af_inet.c:471
> > >> [<  none  >] inet6_release+0x50/0x70 ./net/ipv6/af_inet6.c:416
> > >> [< inline >] constant_test_bit 
> > >> ././arch/x86/include/asm/bitops.h:321
> > >> [<  none  >] sock_release+0x8d/0x200 ./net/socket.c:601
> > >> [<  none  >] sock_close+0x16/0x20 ./net/socket.c:1188
> > >> [<  none  >] __fput+0x21d/0x6e0 ./fs/file_table.c:265
> > >> [<  none  >] fput+0x15/0x20 ./fs/file_table.c:84
> > >> [<  none  >] task_work_run+0x163/0x1f0 
> > >> ./include/trace/events/rcu.h:20
> > >> [< inline >] __list_add ./include/linux/list.h:42
> > >> [< inline >] list_add_tail ./include/linux/list.h:76
> > >> [< inline >] list_move_tail ./include/linux/list.h:168
> > >> [< inline >] reparent_leader ./kernel/exit.c:618
> > >> [< inline >] forget_original_parent ./kernel/exit.c:669
> > >> [< inline >] exit_notify ./kernel/exit.c:697
> > >> [<  none  >] do_exit+0x809/0x2b90 ./kernel/exit.c:878
> > >> [<  none  >] do_group_exit+0x108/0x320 ./kernel/exit.c:985
> >

[PATCH 02/13] net: tcp_memcontrol: properly detect ancestor socket pressure

2015-11-24 Thread Johannes Weiner

When charging socket memory, the code currently checks only the local
page counter for excess to determine whether the memcg is under socket
pressure. But even if the local counter is fine, one of the ancestors
could have breached its limit, which should also force this child to
enter socket pressure. This currently doesn't happen.

Fix this by using page_counter_try_charge() first. If that fails, it
means that either the local counter or one of the ancestors are in
excess of their limit, and the child should enter socket pressure.

Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
Signed-off-by: Johannes Weiner 
Acked-by: David S. Miller 
Reviewed-by: Vladimir Davydov 
---
 include/net/sock.h | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7f89e4b..8133c71 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1190,11 +1190,13 @@ static inline void memcg_memory_allocated_add(struct 
cg_proto *prot,
  unsigned long amt,
  int *parent_status)
 {
-   page_counter_charge(>memory_allocated, amt);
+   struct page_counter *counter;
+
+   if (page_counter_try_charge(>memory_allocated, amt, ))
+   return;
 
-   if (page_counter_read(>memory_allocated) >
-   prot->memory_allocated.limit)
-   *parent_status = OVER_LIMIT;
+   page_counter_charge(>memory_allocated, amt);
+   *parent_status = OVER_LIMIT;
 }
 
 static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/13] mm: memcontrol: account socket memory in unified hierarchy memory controller

2015-11-24 Thread Johannes Weiner

Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation in
the unified hierarchy, this type of memory needs to be included in the
tracking/accounting of a cgroup under active memory resource control.

Overhead is only incurred when a non-root control group is created AND
the memory controller is instructed to track and account the memory
footprint of that group. cgroup.memory=nosocket can be specified on
the boot commandline to override any runtime configuration and
forcibly exclude socket memory from active memory resource control.

Signed-off-by: Johannes Weiner 
---
 Documentation/kernel-parameters.txt |   4 ++
 include/linux/memcontrol.h  |  11 +++-
 mm/memcontrol.c | 122 +---
 3 files changed, 111 insertions(+), 26 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 742f69d..7868f1b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -599,6 +599,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}
 
+   cgroup.memory=  [KNL] Pass options to the cgroup memory controller.
+   Format: 
+   nosocket -- Disable socket memory accounting.
+
checkreqprot[SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
See security/selinux/Kconfig help text.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dad56ef..fae0aaf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -170,6 +170,9 @@ struct mem_cgroup {
unsigned long low;
unsigned long high;
 
+   /* Range enforcement for interrupt charges */
+   struct work_struct high_work;
+
unsigned long soft_limit;
 
/* vmpressure notifications */
@@ -679,7 +682,7 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback 
*wb,
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 struct sock;
 extern struct static_key memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_key_false(_sockets_enabled_key)
@@ -689,11 +692,15 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, 
unsigned int nr_pages);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int 
nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
+#ifdef CONFIG_MEMCG_KMEM
return memcg->tcp_mem.memory_pressure;
+#else
+   return false;
+#endif
 }
 #else
 #define mem_cgroup_sockets_enabled 0
-#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed030b5..59555b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -80,6 +80,9 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
 
 #define MEM_CGROUP_RECLAIM_RETRIES 5
 
+/* Socket memory accounting disabled? */
+static bool cgroup_memory_nosocket;
+
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
 int do_swap_account __read_mostly;
@@ -1923,6 +1926,26 @@ static int memcg_cpu_hotplug_callback(struct 
notifier_block *nb,
return NOTIFY_OK;
 }
 
+static void reclaim_high(struct mem_cgroup *memcg,
+unsigned int nr_pages,
+gfp_t gfp_mask)
+{
+   do {
+   if (page_counter_read(>memory) <= memcg->high)
+   continue;
+   mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+   } while ((memcg = parent_mem_cgroup(memcg)));
+}
+
+static void high_work_func(struct work_struct *work)
+{
+   struct mem_cgroup *memcg;
+
+   memcg = container_of(work, struct mem_cgroup, high_work);
+   reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -1930,20 +1953,13 @@ static int memcg_cpu_hotplug_callback(struct 
notifier_block *nb,
 void mem_cgroup_handle_over_high(void)
 {
unsigned int nr_pages = current->memcg_nr_pages_over_high;
-   struct mem_cgroup *memcg, *pos;
+   struct mem_cgroup *memcg;
 
if (likely(!nr_pages))
return;
 
-   pos = memcg = get_mem_cgroup_from_mm(current->mm);
-
-   do {
-   if (page_counter_read(>memory) <= pos->high)
-   continue;
-   mem_cgroup_events(pos, MEMCG_HIGH, 1);
-

Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal

Tom Herbert  wrote:
> Message size limits can be enforced in BPF or we could add a limit
> enforced by KCM. For instance, the message size limit in http/2 is
> 16M. If it's needed, it wouldn't be much trouble to add a streaming
> interface for large messages.

That still won't change the fact that KCM allows eating large
amount of kernel memory (you could just open a lot of sockets...).

For tcp we cannot exceed the total rmem limits, even if I can open
4k sockets.

Why anyone would invest such a huge amount of work in making this
kernel-based framing for single-stream tcp record (de)mux rather than
improving the userspace protocol to use UDP or SCTP or at least
one tcp connection per worker is beyond me.

For TX side, why is writev not good enough?
Is KCM tx just so that userspace doesn't need to handle partial writes?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net: Generalise wq_has_sleeper helper

2015-11-24 Thread David Miller

From: Herbert Xu 
Date: Tue, 24 Nov 2015 13:54:23 +0800

> On Wed, Nov 11, 2015 at 05:48:29PM +0800, Herbert Xu wrote:
>>
>> BTW, the networking folks found this years ago and even added
>> helpers to deal with this.  See for example wq_has_sleeper in
>> include/net/sock.h.  It would be good if we can move some of
>> those helpers into wait.h instead.
> 
> Here is a patch against net-next which makes the wq_has_sleeper
> helper available to non-next users:
> 
> ---8<---
> The memory barrier in the helper wq_has_sleeper is needed by just
> about every user of waitqueue_active.  This patch generalises it
> by making it take a wait_queue_head_t directly.  The existing
> helper is renamed to skwq_has_sleeper.
> 
> Signed-off-by: Herbert Xu 

I'm fine with wherever this patch goes.  Herbert is there any
particular tree where it'll facilitate another user quickest?

Or should I just toss it into net-next?

Acked-by: David S. Miller 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] bpf: fix clearing on persistent program array maps

2015-11-24 Thread Daniel Borkmann

Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.

Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.

Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.

Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.

Joint work with Alexei.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h   |  5 -
 kernel/bpf/inode.c|  6 +++---
 kernel/bpf/syscall.c  | 36 +---
 kernel/bpf/verifier.c |  3 +--
 4 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index de464e6..83d1926 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -40,6 +40,7 @@ struct bpf_map {
struct user_struct *user;
const struct bpf_map_ops *ops;
struct work_struct work;
+   atomic_t usercnt;
 };
 
 struct bpf_map_type_list {
@@ -167,8 +168,10 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
 void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_put_rcu(struct bpf_prog *prog);
 
-struct bpf_map *bpf_map_get(u32 ufd);
+struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
+void bpf_map_inc(struct bpf_map *map, bool uref);
+void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
 
 extern int sysctl_unprivileged_bpf_disabled;
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index be6d726..5a8a797 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -34,7 +34,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
atomic_inc(&((struct bpf_prog *)raw)->aux->refcnt);
break;
case BPF_TYPE_MAP:
-   atomic_inc(&((struct bpf_map *)raw)->refcnt);
+   bpf_map_inc(raw, true);
break;
default:
WARN_ON_ONCE(1);
@@ -51,7 +51,7 @@ static void bpf_any_put(void *raw, enum bpf_type type)
bpf_prog_put(raw);
break;
case BPF_TYPE_MAP:
-   bpf_map_put(raw);
+   bpf_map_put_with_uref(raw);
break;
default:
WARN_ON_ONCE(1);
@@ -64,7 +64,7 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
void *raw;
 
*type = BPF_TYPE_MAP;
-   raw = bpf_map_get(ufd);
+   raw = bpf_map_get_with_uref(ufd);
if (IS_ERR(raw)) {
*type = BPF_TYPE_PROG;
raw = bpf_prog_get(ufd);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0d3313d..4a8f3c1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -82,6 +82,14 @@ static void bpf_map_free_deferred(struct work_struct *work)
map->ops->map_free(map);
 }
 
+static void

Re: use-after-free in sctp_do_sm

2015-11-24 Thread David Miller

From: Neil Horman 
Date: Tue, 24 Nov 2015 15:45:54 -0500

>> The right commit is:
>> 
>> commit 7d267278a9ece963d77eefec61630223fce08c6c
>> Author: Rainer Weikusat
>> Date:   Fri Nov 20 22:07:23 2015 +
>> unix: avoid use-after-free in ep_remove_wait_queue
> This commit doesn't seem to exist

It's in the 'net' tree.  Which hasn't been pulled into 'net-next' for
a few days.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] vrf: remove slave queue and private slave struct

2015-11-24 Thread David Miller

From: Nikolay Aleksandrov 
Date: Tue, 24 Nov 2015 14:29:16 +0100

> From: Nikolay Aleksandrov 
> 
> The private slave queue and slave struct haven't been used for anything
> and aren't needed, this allows to reduce memory usage and simplify
> enslave/release. We can use netdev_for_each_lower_dev() to free the vrf
> ports when deleting a vrf device. Also if in the future a private struct
> is needed for each slave, it can be implemented via lower devices'
> private member (similar to how bonding does it).
> 
> Signed-off-by: Nikolay Aleksandrov 

Applied, thank you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Tom Herbert

On Tue, Nov 24, 2015 at 12:55 PM, Florian Westphal  wrote:
> Tom Herbert  wrote:
>> Message size limits can be enforced in BPF or we could add a limit
>> enforced by KCM. For instance, the message size limit in http/2 is
>> 16M. If it's needed, it wouldn't be much trouble to add a streaming
>> interface for large messages.
>
> That still won't change the fact that KCM allows eating large
> amount of kernel memory (you could just open a lot of sockets...).
>
> For tcp we cannot exceed the total rmem limits, even if I can open
> 4k sockets.
>
> Why anyone would invest such a huge amount of work in making this
> kernel-based framing for single-stream tcp record (de)mux rather than
> improving the userspace protocol to use UDP or SCTP or at least
> one tcp connection per worker is beyond me.
>
>From the /0 patch:

Q: Why not use an existing message-oriented protocol such as RUDP,
   DCCP, SCTP, RDS, and others?

A: Because that would entail using a completely new transport protocol.
   Deploying a new protocol at scale is either a huge undertaking or
   fundamentally infeasible. This is true in either the Internet and in
   the data center due in a large part to protocol ossification.
   Besides, KCM we want KCM to work existing, well deployed application
   protocols that we couldn't change even if we wanted to (e.g. http/2).

   KCM simply defines a new interface method, it does not redefine any
   aspect of the transport protocol nor application protocol, nor set
   any new requirements on these. Neither does KCM attempt to implement
   any application protocol logic other than message deliniation in the
   stream. These are fundamental requirement of KCM.

> For TX side, why is writev not good enough?

writev on a TCP stream does not guarantee atomicity of the operation.

> Is KCM tx just so that userspace doesn't need to handle partial writes?

It writes atomic without user space needing to implement locking when
a socket is shared amongst threads..
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/13] net: tcp_memcontrol: remove bogus hierarchy pressure propagation

2015-11-24 Thread Johannes Weiner

When a cgroup currently breaches its socket memory limit, it enters
memory pressure mode for itself and its *ancestors*. This throttles
transmission in unrelated sibling and cousin subtrees that have
nothing to do with the breached limit.

On the contrary, breaching a limit should make that group and its
*children* enter memory pressure mode. But this happens already,
albeit lazily: if an ancestor limit is breached, siblings will enter
memory pressure on their own once the next packet arrives for them.

So no additional hierarchy code is needed. Remove the bogus stuff.

Signed-off-by: Johannes Weiner 
Acked-by: David S. Miller 
Reviewed-by: Vladimir Davydov 
---
 include/net/sock.h | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 8133c71..e27a8bb 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1152,14 +1152,8 @@ static inline void sk_leave_memory_pressure(struct sock 
*sk)
if (*memory_pressure)
*memory_pressure = 0;
 
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-   struct cg_proto *cg_proto = sk->sk_cgrp;
-   struct proto *prot = sk->sk_prot;
-
-   for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-   cg_proto->memory_pressure = 0;
-   }
-
+   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+   sk->sk_cgrp->memory_pressure = 0;
 }
 
 static inline void sk_enter_memory_pressure(struct sock *sk)
@@ -1167,13 +1161,8 @@ static inline void sk_enter_memory_pressure(struct sock 
*sk)
if (!sk->sk_prot->enter_memory_pressure)
return;
 
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-   struct cg_proto *cg_proto = sk->sk_cgrp;
-   struct proto *prot = sk->sk_prot;
-
-   for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-   cg_proto->memory_pressure = 1;
-   }
+   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+   sk->sk_cgrp->memory_pressure = 1;
 
sk->sk_prot->enter_memory_pressure(sk);
 }
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/13] mm: memcontrol: account socket memory in unified hierarchy v4

2015-11-24 Thread Johannes Weiner

Hi,

this is version 4 of the patches to add socket memory accounting to
the unified hierarchy memory controller.

Andrew, absent any new showstoppers, please consider merging this
series for v4.5. Thanks!

Changes since v3 include:

- Restored the local vmpressure reporting while preserving the
  hierarchical pressure semantics of the user interface, such that
  networking is throttled also for global memory shortage, and not
  just when encountering configured cgroup limits. As per Vladimir,
  this will make fully provisioned systems perform more smoothly.

- Make packet submission paths enter direct reclaim when memory is
  tight, and reserve the background balancing worklet for receiving
  packets in softirq context.

- Dropped a buggy shrinker cleanup, spotted by Vladimir.

- Fixed a missing return statement, spotted by Eric.

- Documented cgroup.memory=nosocket, as per Michal.

- Rebased onto latest mmots and added ack tags.

Changes since v2 include:

- Fixed an underflow bug in the mem+swap counter that came through the
  design of the per-cpu charge cache. To fix that, the unused mem+swap
  counter is now fully patched out on unified hierarchy. Double whammy.

- Restored the counting jump label such that the networking callbacks
  get patched out again when the last memory-controlled cgroup goes
  away. The code was already there, so we might as well keep it.

- Broke down the massive tcp_memcontrol rewrite patch into smaller
  logical pieces to (hopefully) make it easier to review and verify.

Changes since v1 include:

- No accounting overhead unless a dedicated cgroup is created and the
  memory controller instructed to track that group's memory footprint.
  Distribution kernels enable CONFIG_MEMCG, and users (incl. systemd)
  might create cgroups only for process control or resources other
  than memory. As noted by David and Michal, these setups shouldn't
  pay any overhead for this.

- Continue to enter the socket pressure state when hitting the memory
  controller's hard limit. Vladimir noted that there is at least some
  value in telling other sockets in the cgroup to not increase their
  transmit windows when one of them is already dropping packets.

- Drop the controversial vmpressure rework. Instead of changing the
  level where pressure is noted, keep noting pressure in its origin
  and then make the pressure check hierarchical. As noted by Michal
  and Vladimir, we shouldn't risk changing user-visible behavior.

---

Socket buffer memory can make up a significant share of a workload's
memory footprint that can be directly linked to userspace activity,
and so it needs to be part of the memory controller to provide proper
resource isolation/containment.

Historically, socket buffers were accounted in a separate counter,
without any pressure equalization between anonymous memory, page
cache, and the socket buffers. When the socket buffer pool was
exhausted, buffer allocations would fail hard and cause network
performance to tank, regardless of whether there was still memory
available to the group or not. Likewise, struggling anonymous or cache
workingsets could not dip into an idle socket memory pool. Because of
this, the feature was not usable for many real life applications.

To not repeat this mistake, the new memory controller will account all
types of memory pages it is tracking on behalf of a cgroup in a single
pool. Upon pressure, the VM reclaims and shrinks and puts pressure on
whatever memory consumer in that pool is within its reach.

For socket memory, pressure feedback is provided through vmpressure
events. When the VM has trouble freeing memory, the network code is
instructed to stop growing the cgroup's transmit windows.

This series begins with a rework of the existing tcp memory controller
that simplifies and cleans up the code while allowing us to have only
one set of networking hooks for both memory controller versions. The
original behavior of the existing tcp controller should be preserved.

It then adds socket accounting to the v2 memory controller, including
the use of the per-cpu charge cache and async memory.high enforcement
from socket memory charges.

Lastly, vmpressure is hooked up to the socket code so that it stops
growing transmit windows when the VM has trouble reclaiming memory.

 Documentation/kernel-parameters.txt |   4 +
 include/linux/memcontrol.h  |  71 
 include/linux/vmpressure.h  |   5 +-
 include/net/sock.h  | 149 ++---
 include/net/tcp.h   |   5 +-
 include/net/tcp_memcontrol.h|   1 -
 mm/backing-dev.c|   2 +-
 mm/memcontrol.c | 296 ++
 mm/vmpressure.c |  78 ++---
 mm/vmscan.c |  10 +-
 net/core/sock.c |  78 +++--
 net/ipv4/tcp.c  |   3 +-
 net/ipv4/tcp_ipv4.c |   9 +-

[PATCH 05/13] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets

2015-11-24 Thread Johannes Weiner

The number of allocated sockets is used for calculations in the soft
limit phase, where packets are accepted but the socket is under memory
pressure. Since there is no soft limit phase in tcp_memcontrol, and
memory pressure is only entered when packets are already dropped, this
is actually dead code. Remove it.

As this is the last user of parent_cg_proto(), remove that too.

Signed-off-by: Johannes Weiner 
Acked-by: David S. Miller 
Reviewed-by: Vladimir Davydov 
---
 include/linux/memcontrol.h |  1 -
 include/net/sock.h | 39 +++
 net/ipv4/tcp_memcontrol.c  |  3 ---
 3 files changed, 3 insertions(+), 40 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 724b76a..cc45407 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -87,7 +87,6 @@ enum mem_cgroup_events_target {
 
 struct cg_proto {
struct page_counter memory_allocated;   /* Current allocated 
memory. */
-   struct percpu_counter   sockets_allocated;  /* Current number of 
sockets. */
int memory_pressure;
boolactive;
longsysctl_mem[3];
diff --git a/include/net/sock.h b/include/net/sock.h
index e27a8bb..7afbdab 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1095,19 +1095,9 @@ static inline void sk_refcnt_debug_release(const struct 
sock *sk)
 
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
 extern struct static_key memcg_socket_limit_enabled;
-static inline struct cg_proto *parent_cg_proto(struct proto *proto,
-  struct cg_proto *cg_proto)
-{
-   return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg));
-}
 #define mem_cgroup_sockets_enabled 
static_key_false(_socket_limit_enabled)
 #else
 #define mem_cgroup_sockets_enabled 0
-static inline struct cg_proto *parent_cg_proto(struct proto *proto,
-  struct cg_proto *cg_proto)
-{
-   return NULL;
-}
 #endif
 
 static inline bool sk_stream_memory_free(const struct sock *sk)
@@ -1233,41 +1223,18 @@ sk_memory_allocated_sub(struct sock *sk, int amt)
 
 static inline void sk_sockets_allocated_dec(struct sock *sk)
 {
-   struct proto *prot = sk->sk_prot;
-
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-   struct cg_proto *cg_proto = sk->sk_cgrp;
-
-   for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-   percpu_counter_dec(_proto->sockets_allocated);
-   }
-
-   percpu_counter_dec(prot->sockets_allocated);
+   percpu_counter_dec(sk->sk_prot->sockets_allocated);
 }
 
 static inline void sk_sockets_allocated_inc(struct sock *sk)
 {
-   struct proto *prot = sk->sk_prot;
-
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-   struct cg_proto *cg_proto = sk->sk_cgrp;
-
-   for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-   percpu_counter_inc(_proto->sockets_allocated);
-   }
-
-   percpu_counter_inc(prot->sockets_allocated);
+   percpu_counter_inc(sk->sk_prot->sockets_allocated);
 }
 
 static inline int
 sk_sockets_allocated_read_positive(struct sock *sk)
 {
-   struct proto *prot = sk->sk_prot;
-
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   return 
percpu_counter_read_positive(>sk_cgrp->sockets_allocated);
-
-   return percpu_counter_read_positive(prot->sockets_allocated);
+   return percpu_counter_read_positive(sk->sk_prot->sockets_allocated);
 }
 
 static inline int
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index d07579a..6759e0d 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -32,7 +32,6 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct 
cgroup_subsys *ss)
counter_parent = _cg->memory_allocated;
 
page_counter_init(_proto->memory_allocated, counter_parent);
-   percpu_counter_init(_proto->sockets_allocated, 0, GFP_KERNEL);
 
return 0;
 }
@@ -46,8 +45,6 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
if (!cg_proto)
return;
 
-   percpu_counter_destroy(_proto->sockets_allocated);
-
if (cg_proto->active)
static_key_slow_dec(_socket_limit_enabled);
 
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 2/2] net: add driver for Netronome NFP4000/NFP6000 NIC VFs

2015-11-24 Thread Jakub Kicinski

On Tue, 24 Nov 2015 14:25:31 -0500 (EST), David Miller wrote:
> From: Jakub Kicinski 
> Date: Mon, 23 Nov 2015 11:04:57 +
> 
> > +#ifdef CONFIG_NFP_NET_DEBUG
> > +#define DEBUG
> > +#endif
> 
> Do not design ad-hoc debug logging facilities locally in your driver,
> and instead use the existing tree wide facilities as they were designed
> to be used so that any user can get debugging logs simply by turning it
> on at run time rather than having the change magic config options in
> their kernel.

True.  I picked this habit up from the rt2x00 driver long time ago.
Now I see nobody else is doing such things...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Increasing skb->mark size

2015-11-24 Thread Matt Bennett

Hello,

Currently we have a number of router features (firewall, QoS, etc)
making use of ip tables and connection tracking. We do this by giving
each feature a certain area of skb->mark (say 8 bits each). This allows
us to simply restore skb->mark (using connection tracking) for packets
in a flow using the logic below.

Our software logic is:

1. The first packet in a flow traverses through ip-tables where
each feature has set rules to mark their section of skb->mark. 

2. We then store the mark into connmark. 

3. Then as each packet in the flow hits ip tables the first rule in
ip-tables simply restores the connmark and the packet goes to egress.

Up until now this has worked very well for us. However since skb->mark
is only 32 bits we have quickly run out of bits for marking. 

This leaves us with two options:
 - Don't allow all features to be enabled at once (i.e. multiple
features share the same area of skb->mark). This is not ideal.

 - Increase the size of skb->mark (or another solution such as adding an
additional field into sk_buff for marking).

Hopefully what I have explained above is a strong example of where
skb->mark is no longer large enough on routers using connection tracking
to achieve superior performance. 

I'm emailing this list for feedback on the feasibility of increasing
skb->mark or adding a new field for marking. Perhaps this extension
could be done under a new CONFIG option. Perhaps there are other ways we
could achieve the desired behaviour?

Thanks,
Matt

Re: [PATCH] net: fec: no need to test for the return type of of_property_read_u32

2015-11-24 Thread David Miller

From: Saurabh Sengar 
Date: Mon, 23 Nov 2015 19:21:48 +0530

> in case of error no need to set num_tx and num_rx = 1, because in case of 
> error
> these variables will remain unchanged by of_property_read_u32 ie 1 only
> 
> Signed-off-by: Saurabh Sengar 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Add support for rt_tables.d

2015-11-24 Thread David Ahern

Add support for reading table id/name mappings from rt_tables.d
directory.

Suggested-by: Roopa Prabhu 
Signed-off-by: David Ahern 
---
v2
- comments from Stephen
  - only process files ending in '.conf'
  - add README file to etc/iproute2/rt_tables.d

 etc/iproute2/rt_tables.d/README |  3 +++
 lib/rt_names.c  | 26 ++
 2 files changed, 29 insertions(+)
 create mode 100644 etc/iproute2/rt_tables.d/README

diff --git a/etc/iproute2/rt_tables.d/README b/etc/iproute2/rt_tables.d/README
new file mode 100644
index ..79386f89cc14
--- /dev/null
+++ b/etc/iproute2/rt_tables.d/README
@@ -0,0 +1,3 @@
+Each file in this directory is an rt_tables configuration file. iproute2
+commands scan this directory processing all files that end in '.conf'.
+
diff --git a/lib/rt_names.c b/lib/rt_names.c
index e87c65dad39e..f68e91d6d046 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -339,6 +340,8 @@ static int rtnl_rttable_init;
 
 static void rtnl_rttable_initialize(void)
 {
+   struct dirent *de;
+   DIR *d;
int i;
 
rtnl_rttable_init = 1;
@@ -348,6 +351,29 @@ static void rtnl_rttable_initialize(void)
}
rtnl_hash_initialize(CONFDIR "/rt_tables",
 rtnl_rttable_hash, 256);
+
+   d = opendir(CONFDIR "/rt_tables.d");
+   if (!d)
+   return;
+
+   while ((de = readdir(d)) != NULL) {
+   char path[PATH_MAX];
+   size_t len;
+
+   if (*de->d_name == '.')
+   continue;
+
+   /* only consider filenames ending in '.conf' */
+   len = strlen(de->d_name);
+   if (len <= 5)
+   continue;
+   if (strcmp(de->d_name + len - 5, ".conf"))
+   continue;
+
+   snprintf(path, sizeof(path), CONFDIR "/rt_tables.d/%s", 
de->d_name);
+   rtnl_hash_initialize(path, rtnl_rttable_hash, 256);
+   }
+   closedir(d);
 }
 
 const char * rtnl_rttable_n2a(__u32 id, char *buf, int len)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Al Viro

On Tue, Nov 24, 2015 at 04:30:01PM -0500, Jason Baron wrote:

> So looking at this trace I think its the other->sk_socket that gets
> freed and then we call sk_wake_async() on it.
> 
> We could I think grab the socket reference there with unix_state_lock(),
> since that is held by unix_release_sock() before the final iput() is called.
> 
> So something like below might work (compile tested only):

Ewww...

> +struct socket *unix_peer_get_socket(struct sock *s)
> +{
> + struct socket *peer;
> +
> + unix_state_lock(s);
> + peer = s->sk_socket;
> + if (peer)
> + __iget(SOCK_INODE(s->sk_socket));
> + unix_state_unlock(s);
> +
> + return peer;

>  out_err:
> + if (other_socket)
> + iput(SOCK_INODE(other_socket));
>   scm_destroy();
>   return sent ? : err;
>  }

Interplay between socket, file and inode lifetimes is already too convoluted,
and this just makes it nastier.  I don't have a ready solution at the moment,
but this one is too ugly to live.

Al, digging through RTFS(net/unix/af_unix.c) right now...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/13] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label

2015-11-24 Thread Johannes Weiner

Move the jump-label from sock_update_memcg() and sock_release_memcg()
to the callsite, and so eliminate those function calls when socket
accounting is not enabled.

This also eliminates the need for dummy functions because the calls
will be optimized away if the Kconfig options are not enabled.

Signed-off-by: Johannes Weiner 
Acked-by: David S. Miller 
Reviewed-by: Vladimir Davydov 
---
 include/linux/memcontrol.h |  9 +---
 mm/memcontrol.c| 56 +-
 net/core/sock.c|  9 ++--
 net/ipv4/tcp.c |  3 ++-
 net/ipv4/tcp_ipv4.c|  4 +++-
 5 files changed, 33 insertions(+), 48 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 320b690..724b76a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -697,17 +697,10 @@ static inline void mem_cgroup_wb_stats(struct 
bdi_writeback *wb,
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
-struct sock;
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
-#else
-static inline void sock_update_memcg(struct sock *sk)
-{
-}
-static inline void sock_release_memcg(struct sock *sk)
-{
-}
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6ea649..0b78f82 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -293,46 +293,40 @@ static inline struct mem_cgroup 
*mem_cgroup_from_id(unsigned short id)
 
 void sock_update_memcg(struct sock *sk)
 {
-   if (mem_cgroup_sockets_enabled) {
-   struct mem_cgroup *memcg;
-   struct cg_proto *cg_proto;
+   struct mem_cgroup *memcg;
+   struct cg_proto *cg_proto;
 
-   BUG_ON(!sk->sk_prot->proto_cgroup);
+   BUG_ON(!sk->sk_prot->proto_cgroup);
 
-   /* Socket cloning can throw us here with sk_cgrp already
-* filled. It won't however, necessarily happen from
-* process context. So the test for root memcg given
-* the current task's memcg won't help us in this case.
-*
-* Respecting the original socket's memcg is a better
-* decision in this case.
-*/
-   if (sk->sk_cgrp) {
-   BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
-   css_get(>sk_cgrp->memcg->css);
-   return;
-   }
+   /* Socket cloning can throw us here with sk_cgrp already
+* filled. It won't however, necessarily happen from
+* process context. So the test for root memcg given
+* the current task's memcg won't help us in this case.
+*
+* Respecting the original socket's memcg is a better
+* decision in this case.
+*/
+   if (sk->sk_cgrp) {
+   BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
+   css_get(>sk_cgrp->memcg->css);
+   return;
+   }
 
-   rcu_read_lock();
-   memcg = mem_cgroup_from_task(current);
-   cg_proto = sk->sk_prot->proto_cgroup(memcg);
-   if (cg_proto && cg_proto->active &&
-   css_tryget_online(>css)) {
-   sk->sk_cgrp = cg_proto;
-   }
-   rcu_read_unlock();
+   rcu_read_lock();
+   memcg = mem_cgroup_from_task(current);
+   cg_proto = sk->sk_prot->proto_cgroup(memcg);
+   if (cg_proto && cg_proto->active &&
+   css_tryget_online(>css)) {
+   sk->sk_cgrp = cg_proto;
}
+   rcu_read_unlock();
 }
 EXPORT_SYMBOL(sock_update_memcg);
 
 void sock_release_memcg(struct sock *sk)
 {
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-   struct mem_cgroup *memcg;
-   WARN_ON(!sk->sk_cgrp->memcg);
-   memcg = sk->sk_cgrp->memcg;
-   css_put(>sk_cgrp->memcg->css);
-   }
+   WARN_ON(!sk->sk_cgrp->memcg);
+   css_put(>sk_cgrp->memcg->css);
 }
 
 struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
diff --git a/net/core/sock.c b/net/core/sock.c
index 1e4dd54..04e54bc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1488,12 +1488,6 @@ void sk_free(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_free);
 
-static void sk_update_clone(const struct sock *sk, struct sock *newsk)
-{
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   sock_update_memcg(newsk);
-}
-
 /**
  * sk_clone_lock - clone a socket, and lock its clone
  * @sk: the socket to clone
@@ -1589,7 +1583,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
sk_set_socket(newsk, NULL);
newsk->sk_wq = NULL;
 
-   sk_update_clone(sk, newsk);
+   if

Re: use-after-free in sock_wake_async

2015-11-24 Thread Eric Dumazet

On Tue, Nov 24, 2015 at 1:45 PM, Benjamin LaHaise  wrote:
> On Tue, Nov 24, 2015 at 04:30:01PM -0500, Jason Baron wrote:
>> So looking at this trace I think its the other->sk_socket that gets
>> freed and then we call sk_wake_async() on it.
>>
>> We could I think grab the socket reference there with unix_state_lock(),
>> since that is held by unix_release_sock() before the final iput() is called.
>>
>> So something like below might work (compile tested only):
>
> That just adds the performance regression back in.  It should be possible
> to protect the other socket dereference using RCU.  I haven't had time to
> look at this yet today, but will try to find some time this evening to come
> up with a suggested patch.
>
> -ben
>
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index aaa0b58..2b014f1 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -196,6 +196,19 @@ static inline int unix_recvq_full(struct sock const
>> *sk)
>>   return skb_queue_len(>sk_receive_queue) > sk->sk_max_ack_backlog;
>>  }
>>
>> +struct socket *unix_peer_get_socket(struct sock *s)
>> +{
>> + struct socket *peer;
>> +
>> + unix_state_lock(s);
>> + peer = s->sk_socket;
>> + if (peer)
>> + __iget(SOCK_INODE(s->sk_socket));
>> + unix_state_unlock(s);
>> +
>> + return peer;
>> +}
>> +
>>  struct sock *unix_peer_get(struct sock *s)
>>  {
>>   struct sock *peer;
>> @@ -1639,6 +1652,7 @@ static int unix_stream_sendmsg(struct socket
>> *sock, struct msghdr *msg,
>>  {
>>   struct sock *sk = sock->sk;
>>   struct sock *other = NULL;
>> + struct socket *other_socket = NULL;
>>   int err, size;
>>   struct sk_buff *skb;
>>   int sent = 0;
>> @@ -1662,7 +1676,10 @@ static int unix_stream_sendmsg(struct socket
>> *sock, struct msghdr *msg,
>>   } else {
>>   err = -ENOTCONN;
>>   other = unix_peer(sk);
>> - if (!other)
>> + if (other)
>> + other_socket = unix_peer_get_socket(other);
>> +
>> + if (!other_socket)
>>   goto out_err;
>>   }
>>
>> @@ -1721,6 +1738,9 @@ static int unix_stream_sendmsg(struct socket
>> *sock, struct msghdr *msg,
>>   sent += size;
>>   }
>>
>> + if (other_socket)
>> + iput(SOCK_INODE(other_socket));
>> +
>>   scm_destroy();
>>
>>   return sent;
>> @@ -1733,6 +1753,8 @@ pipe_err:
>>   send_sig(SIGPIPE, current, 0);
>>   err = -EPIPE;
>>  out_err:
>> + if (other_socket)
>> + iput(SOCK_INODE(other_socket));
>>   scm_destroy();
>>   return sent ? : err;
>>  }
>
> --
> "Thought is the essence of where you are now."


This might be a data race in sk_wake_async() if inlined by compiler
(see https://lkml.org/lkml/2015/11/24/680 for another example)

KASAN adds register pressure and compiler can then do 'stupid' things :(

diff --git a/include/net/sock.h b/include/net/sock.h
index 7f89e4ba18d1..2af6222ccc67 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2008,7 +2008,7 @@ static inline unsigned long sock_wspace(struct sock *sk)
 static inline void sk_wake_async(struct sock *sk, int how, int band)
 {
if (sock_flag(sk, SOCK_FASYNC))
-   sock_wake_async(sk->sk_socket, how, band);
+   sock_wake_async(READ_ONCE(sk->sk_socket), how, band);
 }

 /* Since sk_{r,w}mem_alloc sums skb->truesize, even a small frame might
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/3] vhost: introduce vhost_has_work()

2015-11-24 Thread Jason Wang

This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 7 +++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..163b365 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d3f7674..43284ad 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 0/3] basic busy polling support for vhost_net

2015-11-24 Thread Jason Wang

Hi all:

This series tries to add basic busy polling for vhost net. The idea is
simple: at the end of tx/rx processing, busy polling for new tx added
descriptor and rx receive socket for a while. The maximum number of
time (in us) could be spent on busy polling was specified ioctl.

Test A were done through:

- 50 us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected ixgbe
- Guest with 1 vcpu and 1 queue

Results:
- For stream workload, ioexits were reduced dramatically in medium
  size (1024-2048) of tx (at most -43%) and almost all rx (at most
  -84%) as a result of polling. This compensate for the possible
  wasted cpu cycles more or less. That porbably why we can still see
  some increasing in the normalized throughput in some cases.
- Throughput of tx were increased (at most 50%) expect for the huge
  write (16384). And we can send more packets in the case (+tpkts were
  increased).
- Very minor rx regression in some cases.
- Improvemnt on TCP_RR (at most 17%).

Guest TX:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
   64/ 1/  +18%/  -10%/   +7%/  +11%/0%
   64/ 2/  +14%/  -13%/   +7%/  +10%/0%
   64/ 4/   +8%/  -17%/   +7%/   +9%/0%
   64/ 8/  +11%/  -15%/   +7%/  +10%/0%
  256/ 1/  +35%/   +9%/  +21%/  +12%/  -11%
  256/ 2/  +26%/   +2%/  +20%/   +9%/  -10%
  256/ 4/  +23%/0%/  +21%/  +10%/   -9%
  256/ 8/  +23%/0%/  +21%/   +9%/   -9%
  512/ 1/  +31%/   +9%/  +23%/  +18%/  -12%
  512/ 2/  +30%/   +8%/  +24%/  +15%/  -10%
  512/ 4/  +26%/   +5%/  +24%/  +14%/  -11%
  512/ 8/  +32%/   +9%/  +23%/  +15%/  -11%
 1024/ 1/  +39%/  +16%/  +29%/  +22%/  -26%
 1024/ 2/  +35%/  +14%/  +30%/  +21%/  -22%
 1024/ 4/  +34%/  +13%/  +32%/  +21%/  -25%
 1024/ 8/  +36%/  +14%/  +32%/  +19%/  -26%
 2048/ 1/  +50%/  +27%/  +34%/  +26%/  -42%
 2048/ 2/  +43%/  +21%/  +36%/  +25%/  -43%
 2048/ 4/  +41%/  +20%/  +37%/  +27%/  -43%
 2048/ 8/  +40%/  +18%/  +35%/  +25%/  -42%
16384/ 1/0%/  -12%/   -1%/   +8%/  +15%
16384/ 2/0%/  -10%/   +1%/   +4%/   +5%
16384/ 4/0%/  -11%/   -3%/0%/   +3%
16384/ 8/0%/  -10%/   -4%/0%/   +1%

Guest RX:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
   64/ 1/   -2%/  -21%/   +1%/   +2%/  -75%
   64/ 2/   +1%/   -9%/  +12%/0%/  -55%
   64/ 4/0%/   -6%/   +5%/   -1%/  -44%
   64/ 8/   -5%/   -5%/   +7%/  -23%/  -50%
  256/ 1/   -8%/  -18%/  +16%/  +15%/  -63%
  256/ 2/0%/   -8%/   +9%/   -2%/  -26%
  256/ 4/0%/   -7%/   -8%/  +20%/  -41%
  256/ 8/   -8%/  -11%/   -9%/  -24%/  -78%
  512/ 1/   -6%/  -19%/  +20%/  +18%/  -29%
  512/ 2/0%/  -10%/  -14%/   -8%/  -31%
  512/ 4/   -1%/   -5%/  -11%/   -9%/  -38%
  512/ 8/   -7%/   -9%/  -17%/  -22%/  -81%
 1024/ 1/0%/  -16%/  +12%/   +9%/  -11%
 1024/ 2/0%/  -11%/0%/   +3%/  -30%
 1024/ 4/0%/   -4%/   +2%/   +6%/  -15%
 1024/ 8/   -3%/   -4%/   -8%/   -8%/  -70%
 2048/ 1/   -8%/  -23%/  +36%/  +22%/  -11%
 2048/ 2/0%/  -12%/   +1%/   +3%/  -29%
 2048/ 4/0%/   -3%/  -17%/  -15%/  -84%
 2048/ 8/0%/   -3%/   +1%/   -3%/  +10%
16384/ 1/0%/  -11%/   +4%/   +7%/  -22%
16384/ 2/0%/   -7%/   +4%/   +4%/  -33%
16384/ 4/0%/   -2%/   -2%/   -4%/  -23%
16384/ 8/   -1%/   -2%/   +1%/  -22%/  -40%

TCP_RR:
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/  +11%/  -26%/  +11%/  +11%/  +10%
1/25/  +11%/  -15%/  +11%/  +11%/0%
1/50/   +9%/  -16%/  +10%/  +10%/0%
1/   100/   +9%/  -15%/   +9%/   +9%/0%
   64/ 1/  +11%/  -31%/  +11%/  +11%/  +11%
   64/25/  +12%/  -14%/  +12%/  +12%/0%
   64/50/  +11%/  -14%/  +12%/  +12%/0%
   64/   100/  +11%/  -15%/  +11%/  +11%/0%
  256/ 1/  +11%/  -27%/  +11%/  +11%/  +10%
  256/25/  +17%/  -11%/  +16%/  +16%/   -1%
  256/50/  +16%/  -11%/  +17%/  +17%/   +1%
  256/   100/  +17%/  -11%/  +18%/  +18%/   +1%

Test B were done through:

- 50us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected ixgbe
- Two guests each wich 1 vcpu and 1 queue
- pin two vhost threads to the same cpu on host to simulate the cpu
  contending

Results:
- In this radical case, we can still get at most 14% improvement on
  TCP_RR.
- For guest tx stream, minor improvemnt with at most 5% regression in
  one byte case. For guest rx stream, at most 5% regression were seen.

Guest TX:
size /-+%   /
1/-5.55%/
64   /+1.11%/
256  /+2.33%/
512  /-0.03%/
1024 /+1.14%/
4096 /+0.00%/
16384/+0.00%/

Guest RX:
size /-+%   /
1/-5.11%/
64   /-0.55%/
256  /-2.35%/
512  /-3.39%/
1024 /+6.8% /
4096 /-0.01%/
16384/+0.00%/

TCP_RR:
size /-+%/
1/+9.79% /
64   /+4.51% /
256  /+6.47% /
512  /-3.37% /
1024 /+6.15% /
4096 /+14.88%/
16384/-2.23% /

Changes from RFC V3:
- small tweak on the code to avoid

[PATCH net-next 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-24 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 26 +-
 drivers/vhost/vhost.h |  1 +
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 163b365..b86c5aa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 
+bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+   __virtio16 avail_idx;
+   int r;
+
+   r = __get_user(avail_idx, >avail->idx);
+   if (r) {
+   vq_err(vq, "Failed to check avail idx at %p: %d\n",
+  >avail->idx, r);
+   return false;
+   }
+
+   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
+
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
int r;
 
if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
@@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
 * sure it's written, then check again. */
smp_mb();
-   r = __get_user(avail_idx, >avail->idx);
-   if (r) {
-   vq_err(vq, "Failed to check avail idx at %p: %d\n",
-  >avail->idx, r);
-   return false;
-   }
-
-   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+   return vhost_vq_more_avail(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_enable_notify);
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 43284ad..2f3c57c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, struct 
vhost_virtqueue *,
   struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 3/3] vhost_net: basic polling support

2015-11-24 Thread Jason Wang

This patch tries to poll for new added tx buffer or socket receive
queue for a while at the end of tx/rx processing. The maximum time
spent on polling were specified through a new kind of vring ioctl.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c| 72 ++
 drivers/vhost/vhost.c  | 15 ++
 drivers/vhost/vhost.h  |  1 +
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..ce6da77 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -287,6 +287,41 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static inline unsigned long busy_clock(void)
+{
+   return local_clock() >> 10;
+}
+
+static bool vhost_can_busy_poll(struct vhost_dev *dev,
+   unsigned long endtime)
+{
+   return likely(!need_resched()) &&
+  likely(!time_after(busy_clock(), endtime)) &&
+  likely(!signal_pending(current)) &&
+  !vhost_has_work(dev) &&
+  single_task_running();
+}
+
+static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
+   struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num)
+{
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   while (vhost_can_busy_poll(vq->dev, endtime) &&
+  !vhost_vq_more_avail(vq->dev, vq))
+   cpu_relax();
+   preempt_enable();
+   }
+
+   return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+out_num, in_num, NULL, NULL);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -331,10 +366,9 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-, ,
-NULL, NULL);
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
@@ -435,6 +469,34 @@ static int peek_head_len(struct sock *sk)
return len;
 }
 
+static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   mutex_lock(>mutex);
+   vhost_disable_notify(>dev, vq);
+
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+
+   while (vhost_can_busy_poll(>dev, endtime) &&
+  skb_queue_empty(>sk_receive_queue) &&
+  !vhost_vq_more_avail(>dev, vq))
+   cpu_relax();
+
+   preempt_enable();
+
+   if (vhost_enable_notify(>dev, vq))
+   vhost_poll_queue(>poll);
+   mutex_unlock(>mutex);
+   }
+
+   return peek_head_len(sk);
+}
+
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
  * @vq - the relevant virtqueue
@@ -553,7 +615,7 @@ static void handle_rx(struct vhost_net *net)
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
-   while ((sock_len = peek_head_len(sock->sk))) {
+   while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
headcount = get_rx_bufs(vq, vq->heads, vhost_len,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b86c5aa..857af6c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -285,6 +285,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->memory = NULL;
vq->is_le = virtio_legacy_is_little_endian();
vhost_vq_reset_user_be(vq);
+   vq->busyloop_timeout = 0;
 }
 
 static int vhost_worker(void *data)
@@ -747,6 +748,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void 
__user *argp)
struct

Re: [PATCH v1 1/6] net: Generalize udp based tunnel offload

2015-11-24 Thread Tom Herbert

>
> FWIW, I've brought the issue to the attention of the architects here,
> and we will likely be able to make changes in this space.  Intel
> hardware (as demonstrated by your patches) already is able to deal with
> this de-ossification on transmit.  Receive is a whole different beast.
>
Please provide the specifics on why "Receive is a whole different
beast.". Generic receive checksum is already a subset of the
functionality that you must have implement to support the protocol
specific offloads. All the hardware needs to do is calculate the 1's
complement checksum of the packet and return the value on the to the
host with that packet. That's it. No parsing of headers, no worrying
about the pseudo header, no dealing with any encapsulation. Just do
the calculation, return the result to the host and the driver converts
this to CHECKSUM_COMPLETE. I find it very hard to believe that this is
any harder than specific support the next protocol du jour.

> I think that trying to force an agenda with no fore-warning and also
> punishing the users in order to get hardware vendors to change is the
> wrong way to go about this.  All you end up with is people just asking
> you why their hardware doesn't work in the kernel.
>
As you said this in only feedback and nobody is forcing anyone to do
anything. But encouraging HW vendors to provide generic mechanisms so
that your users can use whatever protocol they want is the exact
_opposite_ of punishing users, this is very much a pro-user direction.

> You have a proposal, let's codify it and enable it for the future, and
> especially be *really* clear what you want hardware vendors to
> implement so that they get it right.  MS does this by publishing
> specifications and being clear what MUST be implemented and what COULD
> be implemented.
>

Linux does not mandate HW implementation like MS, what we we do is
define driver interfaces which allow for a variety of different HW
implementations. The stack-driver checksum interface is described at
the top of skbuff.h. If this interface description is not clear enough
please let me know and we can fix that. If it is helpful we can
publish our requirements of new NICs at Facebook for reference.

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/ipv4/ipconfig: Rejoin broken lines in console output

2015-11-24 Thread Joe Perches

On Tue, 2015-11-24 at 14:08 +0100, Geert Uytterhoeven wrote:
> Commit 09605cc12c078306 ("net ipv4: use preferred log methods") replaced
> a few calls of pr_cont() after a console print without a trailing
> newline by pr_info(), causing lines to be split during IP
> autoconfiguration


Thanks Geert.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 1/6] net: Generalize udp based tunnel offload

2015-11-24 Thread Hannes Frederic Sowa

On Tue, Nov 24, 2015, at 18:32, Tom Herbert wrote:
> As you said this in only feedback and nobody is forcing anyone to do
> anything. But encouraging HW vendors to provide generic mechanisms so
> that your users can use whatever protocol they want is the exact
> _opposite_ of punishing users, this is very much a pro-user direction.

Some users will suffer worse performance in case we don't correctly set
ip_summed for a specific protocol before we do the copy operations from
user space into skbs but if they are always done in the driver.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Tom Herbert

On Tue, Nov 24, 2015 at 9:16 AM, Florian Westphal  wrote:
> Tom Herbert  wrote:
>> No one is being forced to use any of this.
>
> Right.  But it will need to be maintained.
> Lets ignore ktls for the time being and focus on KCM.
>
> I'm currently trying to figure out how memory handling in KCM
> is supposed to work.
>
> say we have following record framing:
>
> struct record {
> u32 len;
> char data[];
> };
>
> And I have a epbf filter that returns record->len within KCM.
> Now this program says 'length 128mbyte' (or whatever).
>
> If this was userspace, things are simple, userspace can either
> decide to hang up or start to read this in chunks as data arrives.
>
> AFAICS, with KCM, the kernel now has to keep 128mb of allocated
> memory around, rmem limits are ignored.
>
> Is that correct?  What if next record claims 4g in size?
> I don't really see how we can make any guarantees wrt.
> kernel stability...
>
> Am I missing something?

Message size limits can be enforced in BPF or we could add a limit
enforced by KCM. For instance, the message size limit in http/2 is
16M. If it's needed, it wouldn't be much trouble to add a streaming
interface for large messages.

>
> Thanks,
> Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 1/6] net: Generalize udp based tunnel offload

2015-11-24 Thread Tom Herbert

On Tue, Nov 24, 2015 at 9:43 AM, Hannes Frederic Sowa
 wrote:
> On Tue, Nov 24, 2015, at 18:32, Tom Herbert wrote:
>> As you said this in only feedback and nobody is forcing anyone to do
>> anything. But encouraging HW vendors to provide generic mechanisms so
>> that your users can use whatever protocol they want is the exact
>> _opposite_ of punishing users, this is very much a pro-user direction.
>
> Some users will suffer worse performance in case we don't correctly set
> ip_summed for a specific protocol before we do the copy operations from
> user space into skbs but if they are always done in the driver.
>
Please be specific. Who are the users, what is exact performance
regression, what are specific protocols in question?

> Bye,
> Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 207 matches

Mail list logo