Re: [PATCH v2 net-next 0/3] ip: Use rb trees for IP frag queue

2018-08-03 Thread Josh Hunt
On Thu, Aug 2, 2018 at 4:34 PM, Peter Oskolkov  wrote:

> This patchset
>  * changes IPv4 defrag behavior to match that of IPv6: overlapping
>fragments now cause the whole IP datagram to be discarded (suggested
>by David Miller): there are no legitimate use cases for overlapping
>fragments;
>  * changes IPv4 defrag queue from a list to a rb tree (suggested
>by Eric Dumazet): this change removes a potential attach vector.
>
> Upcoming patches will contain similar changes for IPv6 frag queue,
> as well as a comprehensive IP defrag self-test (temporarily delayed).
>
> Peter Oskolkov (3):
>   ip: discard IPv4 datagrams with overlapping segments.
>   net: modify skb_rbtree_purge to return the truesize of all purged
> skbs.
>   ip: use rb trees for IP frag queue.
>
>  include/linux/skbuff.h  |  11 +-
>  include/net/inet_frag.h |   3 +-
>  include/uapi/linux/snmp.h   |   1 +
>  net/core/skbuff.c   |   6 +-
>  net/ipv4/inet_fragment.c|  16 +-
>  net/ipv4/ip_fragment.c  | 239 +++-
>  net/ipv4/proc.c |   1 +
>  net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
>  net/ipv6/reassembly.c   |   1 +
>  9 files changed, 139 insertions(+), 140 deletions(-)
>
> --
> 2.18.0.597.ga71716f1ad-goog
>
>
Peter

I just tested your patches along with Florian's on top of net-next. Things
look much better wrt this type of attack. Thanks for doing this. I'm
wondering if we want to put an optional mechanism in place to limit the
size of the tree in terms of skbs it can hold? Otherwise an attacker can
send ~1400 8 byte frags and consume all frag memory (default high thresh is
4M) pretty easily and I believe also evict other frags which may have been
pending? I am guessing this is what Florian's min MTU patches are trying to
help with.

-- 
Josh


Re: TCP many-connection regression between 4.7 and 4.13 kernels.

2018-01-22 Thread Josh Hunt
On Mon, Jan 22, 2018 at 10:30 AM, Ben Greear  wrote:
> On 01/22/2018 10:16 AM, Eric Dumazet wrote:
>>
>> On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:
>>>
>>> My test case is to have 6 processes each create 5000 TCP IPv4 connections
>>> to each other
>>> on a system with 16GB RAM and send slow-speed data.  This works fine on a
>>> 4.7 kernel, but
>>> will not work at all on a 4.13.  The 4.13 first complains about running
>>> out of tcp memory,
>>> but even after forcing those values higher, the max connections we can
>>> get is around 15k.
>>>
>>> Both kernels have my out-of-tree patches applied, so it is possible it is
>>> my fault
>>> at this point.
>>>
>>> Any suggestions as to what this might be caused by, or if it is fixed in
>>> more recent kernels?
>>>
>>> I will start bisecting in the meantime...
>>>
>>
>> Hi Ben
>>
>> Unfortunately I have no idea.
>>
>> Are you using loopback flows, or have I misunderstood you ?
>>
>> How loopback connections can be slow-speed ?
>>
>
> I am sending to self, but over external network interfaces, by using
> routing tables and rules and such.
>
> On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
> connections.  In this case, I have a pair of 10G ports doing 15k, and then
> I try to start 5k on two of the 1G ports
>
> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Down
> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Down
> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Down
> Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Down
> Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3
> (e1000e): transmit queue 0 timed out, trans_s...es: 1
> Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e :07:00.0
> eth3: Reset adapter unexpectedly
>

Ben

We had an interface doing this and grabbing these commits resolved it for us:

4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts
19110cfbb34d e1000e: Separate signaling for link check/link up
d3509f8bc7b0 e1000e: Fix return value test
65a29da1f5fd e1000e: Fix wrong comment related to link detection
c4c40e51f9c3 e1000e: Fix error path in link detection

They are in the LTS kernels now, but don't believe they were when we
first hit this problem.

Josh


[PATCH] net/sched: fix pointer check in gen_handle

2017-09-10 Thread Josh Hunt
Fixes sparse warning about pointer in gen_handle:
net/sched/cls_rsvp.h:392:40: warning: Using plain integer as NULL pointer

Fixes: 8113c095672f6 ("net_sched: use void pointer for filter handle")
Signed-off-by: Josh Hunt <joh...@akamai.com>
---
 net/sched/cls_rsvp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index 98c05db85bcb..b1f6ed48bc72 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -389,7 +389,7 @@ static unsigned int gen_handle(struct tcf_proto *tp, 
unsigned salt)
if ((data->hgenerator += 0x1) == 0)
data->hgenerator = 0x1;
h = data->hgenerator|salt;
-   if (rsvp_get(tp, h) == 0)
+   if (!rsvp_get(tp, h))
return h;
}
return 0;
-- 
1.9.1



[PATCH] sock: introduce SO_MEMINFO getsockopt

2017-03-20 Thread Josh Hunt
Allows reading of SK_MEMINFO_VARS via socket option. This way an
application can get all meminfo related information in single socket
option call instead of multiple calls.

Adds helper function, sk_get_meminfo(), and uses that for both
getsockopt and sock_diag_put_meminfo().

Suggested by Eric Dumazet.

Signed-off-by: Josh Hunt <joh...@akamai.com>
Reviewed-by: Jason Baron <jba...@akamai.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 arch/alpha/include/uapi/asm/socket.h   |  2 ++
 arch/avr32/include/uapi/asm/socket.h   |  2 ++
 arch/frv/include/uapi/asm/socket.h |  2 ++
 arch/ia64/include/uapi/asm/socket.h|  2 ++
 arch/m32r/include/uapi/asm/socket.h|  2 ++
 arch/mips/include/uapi/asm/socket.h|  3 +++
 arch/mn10300/include/uapi/asm/socket.h |  2 ++
 arch/parisc/include/uapi/asm/socket.h  |  2 ++
 arch/powerpc/include/uapi/asm/socket.h |  2 ++
 arch/s390/include/uapi/asm/socket.h|  2 ++
 arch/sparc/include/uapi/asm/socket.h   |  2 ++
 arch/xtensa/include/uapi/asm/socket.h  |  2 ++
 include/net/sock.h |  2 ++
 include/uapi/asm-generic/socket.h  |  2 ++
 net/core/sock.c| 30 ++
 net/core/sock_diag.c   | 10 +-
 16 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index afc901b..089db42 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -99,4 +99,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h 
b/arch/avr32/include/uapi/asm/socket.h
index 5a65042..6eabcbd 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -92,4 +92,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index 81e0353..bd497f8 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -92,5 +92,7 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 57feb0c..f1bb546 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index 5853f8e9..459c460 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -92,4 +92,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 566ecdc..688c18d 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -110,4 +110,7 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index 0e12527..312d2c4 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -92,4 +92,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 7a109b7..b98ec38 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -91,4 +91,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 0x402F
 
+#define SO_MEMINFO 0x4030
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h 
b/arch/powerpc/include/uapi/asm/socket.h
index 44583a5..099a889 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -99,4 +99,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_MEMINFO 55
+
 #endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index b24a64c..6199bb3 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -98,4 +98,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#defineSO_MEMINFO  55
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index a25dc32..12cd8c2 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -88,6 +88,8 @@
 
 #define SCM_TIME

Re: [PATCH] net: udp: add socket option to report RX queue level

2017-03-17 Thread Josh Hunt

On 03/17/2017 05:01 PM, Eric Dumazet wrote:

On Fri, 2017-03-17 at 14:13 -0700, Chris Kuiper wrote:

This adds a new socket option "SO_RXQ_ALLOC" that enables providing
the RX queue buffer allocation as ancillary data from the recvmsg()
system call. The value reported is a byte number and together with
the RX queue size (obtained via getsockopt(SO_RCVBUF) can be used to
calculate a percentage value on how full the socket buffer is.
---


Seems a lot of overhead, and only UDP would be supported.

I very much prefer Josh Hunt proposal
( https://patchwork.ozlabs.org/patch/738250/ )

Ie using a separate getsockopt() call instead of adding code to UDP fast
path ?



Yes, let me know and I can send a patch adding a sockopt to return all 
of the meminfo vars.


Re: [RFC PATCH] sock: add SO_RCVQUEUE_SIZE getsockopt

2017-03-14 Thread Josh Hunt

On 03/13/2017 07:10 PM, David Miller wrote:

From: Josh Hunt <joh...@akamai.com>
Date: Mon, 13 Mar 2017 18:34:41 -0500


In this particular case they really do want to know total # of bytes
in the receive queue, not the data bytes they can consume from an
application pov. The kernel currently only exposes this value through
netlink or /proc/net/udp from what I saw.


Can you explain in what way this is useful?

The difference between skb->len and skb->truesize is really kernel
internal implementation detail, and I'm trying to figure out why
this would be useful to an application.



First, it looks like my original patch was against an old kernel which 
did not have the updated udp accounting code. Not sure how that 
happened. Apologies for that. There's no need to add in the backlog, at 
least for udp now, sk_rmem_alloc is all that is needed for my case.


The application here is interested in monitoring the amount of data in 
the receive buffer. Looking for and identifying overflows, and also 
understanding how full it is. I know we already have SO_RXQ_OVFL, but 
this only shows the # of drops on overflow.


We expose this (skmem) information via /proc and netlink today. It seems 
like unnecessary overhead to require an application to also create a 
netlink socket to get this data.


Creating a socket option to mimic the behavior of 
sock_diag_put_meminfo() and export all meminfo_vars would be great if 
that's something you'd accept.


Josh


Re: [RFC PATCH] sock: add SO_RCVQUEUE_SIZE getsockopt

2017-03-13 Thread Josh Hunt

On 03/13/2017 02:39 PM, David Miller wrote:

From: Josh Hunt <joh...@akamai.com>
Date: Mon, 13 Mar 2017 12:38:39 -0500


On 03/13/2017 11:12 AM, Eric Dumazet wrote:

On Mon, Mar 13, 2017 at 8:59 AM, Josh Hunt <joh...@akamai.com> wrote:

Allows application to read the amount of data sitting in the receive
queue.

Signed-off-by: Josh Hunt <joh...@akamai.com>
---

A team here is looking for a way to get the amount of data in a UDP
socket's
receive queue. It seems like this should be SIOCINQ, but for UDP
sockets that
returns the size of the next pending datagram. I implemented the patch
below,
but am wondering if this is the right place for this change? I was
debating
between this or a new UDP ioctl.


But what is the 'amount of data' exactly ?
Number of packets, amount of bytes to read from these packets ?


I meant bytes. I will clarify in the next version.


As Eric is hinting, the calculation you are using doesn't represent
this.

You need to do something like walk the receive queue and add the
skb->len values together.

sk->sk_rmem_alloc is usually much larger than the sum of the skb->len
values in the socket receive queue.  I don't see how this culmination
of skb->truesize values is useful, whereas I can see how an application
could want the summation of the skb->len values.



In this particular case they really do want to know total # of bytes in 
the receive queue, not the data bytes they can consume from an 
application pov. The kernel currently only exposes this value through 
netlink or /proc/net/udp from what I saw.


I believe Eric's suggestion in his previous mail was to export all of 
these meminfo metrics via a single socket option call similar to how its 
done in netlink. We could then use that for both call sites.


I agree that it would be useful to also have the data you and Eric are 
suggesting exposed somewhere, the total # of skb->len bytes sitting in 
the receive queue. I could add that as a second socket option.


Does this sound reasonable?

Josh


Re: [RFC PATCH] sock: add SO_RCVQUEUE_SIZE getsockopt

2017-03-13 Thread Josh Hunt

On 03/13/2017 11:12 AM, Eric Dumazet wrote:

On Mon, Mar 13, 2017 at 8:59 AM, Josh Hunt <joh...@akamai.com> wrote:

Allows application to read the amount of data sitting in the receive
queue.

Signed-off-by: Josh Hunt <joh...@akamai.com>
---

A team here is looking for a way to get the amount of data in a UDP socket's
receive queue. It seems like this should be SIOCINQ, but for UDP sockets that
returns the size of the next pending datagram. I implemented the patch below,
but am wondering if this is the right place for this change? I was debating
between this or a new UDP ioctl.


But what is the 'amount of data' exactly ?
Number of packets, amount of bytes to read from these packets ?


I meant bytes. I will clarify in the next version.



You chose to report kernel memory usage, which is not guaranteed to be
the same among kernels versions (or kernel configs)

If we export these internals, I would export the whole thing, like we
did with netlink

ie tweak sock_diag_put_meminfo() and export the SK_MEMINFO_VARS

So that we avoid adding other options in the future.



OK, so you're suggesting we provide all of the SK_MEMINFO_VARS data via 
socket option then? That sounds good to me. I will send that in a v2.


Josh


[RFC PATCH] sock: add SO_RCVQUEUE_SIZE getsockopt

2017-03-13 Thread Josh Hunt
Allows application to read the amount of data sitting in the receive
queue.

Signed-off-by: Josh Hunt <joh...@akamai.com>
---

A team here is looking for a way to get the amount of data in a UDP socket's
receive queue. It seems like this should be SIOCINQ, but for UDP sockets that
returns the size of the next pending datagram. I implemented the patch below,
but am wondering if this is the right place for this change? I was debating
between this or a new UDP ioctl.

 include/net/sock.h| 7 ++-
 include/uapi/asm-generic/socket.h | 2 ++
 net/core/sock.c   | 4 
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5e59976..bcfed2ae 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -854,6 +854,11 @@ static inline void __sk_add_backlog(struct sock *sk, 
struct sk_buff *skb)
skb->next = NULL;
 }
 
+static unsigned int sk_rcvqueue_size(const struct sock *sk)
+{
+   return sk->sk_backlog.len + atomic_read(>sk_rmem_alloc);
+}
+
 /*
  * Take into account size of receive queue and backlog queue
  * Do not take into account this skb truesize,
@@ -861,7 +866,7 @@ static inline void __sk_add_backlog(struct sock *sk, struct 
sk_buff *skb)
  */
 static inline bool sk_rcvqueues_full(const struct sock *sk, unsigned int limit)
 {
-   unsigned int qsize = sk->sk_backlog.len + 
atomic_read(>sk_rmem_alloc);
+   unsigned int qsize = sk_rcvqueue_size(sk);
 
return qsize > limit;
 }
diff --git a/include/uapi/asm-generic/socket.h 
b/include/uapi/asm-generic/socket.h
index 2c748dd..36b8cd5 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -94,4 +94,6 @@
 
 #define SCM_TIMESTAMPING_OPT_STATS 54
 
+#define SO_RCVQUEUE_SIZE   55
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index f6fd79f..fa69864 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1269,6 +1269,10 @@ int sock_getsockopt(struct socket *sock, int level, int 
optname,
v.val = sk->sk_incoming_cpu;
break;
 
+   case SO_RCVQUEUE_SIZE:
+   v.val = sk_rcvqueue_size(sk);
+   break;
+
default:
/* We implement the SO_SNDLOWAT etc to not be settable
 * (1003.1g 7).
-- 
1.9.1



[RFC] TCP_NOTSENT_LOWAT behavior

2017-02-16 Thread Josh Hunt
Eric

A team here was using the TCP_NOTSENT_LOWAT socket option and noticed that
more unsent data than they were expecting was sitting in the write queue. I
took a look and noticed that while we don't allow allocation of new skbs once
we exceed this value, we still allow adding data to the skb at the tail of the
write queue. In this context that means we could add up to size_goal to the
skb, which could be up to 64kb.

The patch below attempts to put a cap on the amount we allow to write over
the TCP_NOTSENT_LOWAT value at 50%. In cases where the setting is smaller this
will allow the # of unsent bytes to more closely reflect the value. In cases
where the setting is 128kb or higher this will have no impact compared to the
current behavior. This should have two benefits: 1) finer-grain control of the
amount of unsent data, 2) reduction of TCP memory for values of 
TCP_NOTSENT_LOWAT
< 128k.

I reran the netperf results from your original commit with and without my patch:

4.10.0-rc8:
# echo $(( 128 * 1024 )) > /proc/sys/net/ipv4/tcp_notsent_lowat
# (./super_netperf 200 -H remote -t TCP_STREAM -l 90 &); sleep 60; grep TCP 
/proc/net/protocols
TCPv6 2064  2   21735   no 208   yes  ipv6y  y  y  y  y  y  
y  y  y  y  y  y  y  n  y  y  y  y  y
TCP   1912465   21735   no 208   yes  kernel  y  y  y  y  y  y  
y  y  y  y  y  y  y  n  y  y  y  y  y

# echo $(( 64 * 1024 )) > /proc/sys/net/ipv4/tcp_notsent_lowat
# (./super_netperf 200 -H remote -t TCP_STREAM -l 90 &); sleep 60; grep TCP 
/proc/net/protocols
TCPv6 2064  2   19859   no 208   yes  ipv6y  y  y  y  y  y  
y  y  y  y  y  y  y  n  y  y  y  y  y
TCP   1912465   19859   no 208   yes  kernel  y  y  y  y  y  y  
y  y  y  y  y  y  y  n  y  y  y  y  y

4.10.0-rc8 + patch:
# echo $(( 128 * 1024 )) > /proc/sys/net/ipv4/tcp_notsent_lowat
# (./super_netperf 200 -H remote -t TCP_STREAM -l 90 &); sleep 60; grep TCP 
/proc/net/protocols
TCPv6 2064  2   21570   no 208   yes  ipv6y  y  y  y  y  y  
y  y  y  y  y  y  y  n  y  y  y  y  y
TCP   1912465   21570   no 208   yes  kernel  y  y  y  y  y  y  
y  y  y  y  y  y  y  n  y  y  y  y  y

# echo $(( 64 * 1024 )) > /proc/sys/net/ipv4/tcp_notsent_lowat
# (./super_netperf 200 -H remote -t TCP_STREAM -l 90 &); sleep 60; grep TCP 
/proc/net/protocols
TCPv6 2064  2   18257   no 208   yes  ipv6y  y  y  y  y  y  
y  y  y  y  y  y  y  n  y  y  y  y  y
TCP   1912465   18257   no 208   yes  kernel  y  y  y  y  y  y  
y  y  y  y  y  y  y  n  y  y  y  y  y

I still need to do more testing, but wanted to get feedback on the idea.

Josh

---
In tcp_sendmsg() we check to see if we are under the TCP_NOTSENT_LOWAT
threshold before allocating a new skb. However we do no checks when adding
page frags to an existing skb. On systems with large size_goals we can
wind up adding up to 64k to the send queue over the TCP_NOTSENT_LOWAT
setting.

This patch adds a cap on the amount of data we can add to the send queue
at 50% above the TCP_NOTSENT_LOWAT setting.

Signed-off-by: Josh Hunt <joh...@akamai.com>
---
 include/net/tcp.h |  7 ++-
 net/ipv4/tcp.c| 14 ++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6061963..a2cce3c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1773,10 +1773,15 @@ static inline u32 tcp_notsent_lowat(const struct 
tcp_sock *tp)
return tp->notsent_lowat ?: net->ipv4.sysctl_tcp_notsent_lowat;
 }
 
+static inline u32 tcp_notsent_bytes(const struct tcp_sock *tp)
+{
+   return tp->write_seq - tp->snd_nxt;
+}
+
 static inline bool tcp_stream_memory_free(const struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
-   u32 notsent_bytes = tp->write_seq - tp->snd_nxt;
+   u32 notsent_bytes = tcp_notsent_bytes(tp);
 
return notsent_bytes < tcp_notsent_lowat(tp);
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0efb4c7..c1b5e08 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1178,6 +1178,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
while (msg_data_left(msg)) {
int copy = 0;
int max = size_goal;
+   int notsent_lowat;
 
skb = tcp_write_queue_tail(sk);
if (tcp_send_head(sk)) {
@@ -1227,6 +1228,19 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
TCP_SKB_CB(skb)->sacked |= TCPCB_REPAIRED;
}
 
+   notsent_lowat = tcp_notsent_lowat(tp);
+   if (notsent_lowat != -1) {
+   /* Cap unsent bytes at no more than 50% above 
TCP_NOTSENT_LOWAT value */
+   notsent_lowat += (notsent_lowat >> 1) - 
tcp_notsent_bytes(tp);
+   

Re: [PATCH net] tcp: fastopen: avoid negative sk_forward_alloc

2016-09-07 Thread Josh Hunt

On 09/07/2016 10:34 AM, Eric Dumazet wrote:

From: Eric Dumazet <eduma...@google.com>

When DATA and/or FIN are carried in a SYN/ACK message or SYN message,
we append an skb in socket receive queue, but we forget to call
sk_forced_mem_schedule().

Effect is that the socket has a negative sk->sk_forward_alloc as long as
the message is not read by the application.

Josh Hunt fixed a similar issue in commit d22e15371811 ("tcp: fix tcp
fin memory accounting")

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
  net/ipv4/tcp_fastopen.c |1 +
  1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 62a5751d4fe1..4e777a3243f9 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -150,6 +150,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff 
*skb)
tp->segs_in = 0;
tcp_segs_in(tp, skb);
__skb_pull(skb, tcp_hdrlen(skb));
+   sk_forced_mem_schedule(sk, skb->truesize);
skb_set_owner_r(skb, sk);

TCP_SKB_CB(skb)->seq++;




The change makes sense to me. Thanks Eric!

Reviewed-by: Josh Hunt <joh...@akamai.com>


Re: Poorer networking performance in later kernels?

2016-04-19 Thread Josh Hunt
On Tue, Apr 19, 2016 at 9:54 AM, Butler, Peter  wrote:
>> -Original Message-
>> From: Rick Jones [mailto:rick.jon...@hpe.com]
>> Sent: April-15-16 6:37 PM
>> To: Butler, Peter ; netdev@vger.kernel.org
>> Subject: Re: Poorer networking performance in later kernels?
>>
>> On 04/15/2016 02:02 PM, Butler, Peter wrote:
>>> (Please keep me CC'd to all comments/responses)
>>>
>>> I've tried a kernel upgrade from 3.4.2 to 4.4.0 and see a marked drop
>>> in networking performance.  Nothing was changed on the test systems,
>>> other than the kernel itself (and kernel modules).  The identical
>>> .config used to build the 3.4.2 kernel was brought over into the
>>> 4.4.0 kernel source tree, and any configuration differences (e.g. new
>>> parameters, etc.) were taken as default values.
>>>
>>> The testing was performed on the same actual hardware for both kernel
>>> versions (i.e. take the existing 3.4.2 physical setup, simply boot
>>> into the (new) kernel and run the same test).  The netperf utility
>>> was used for benchmarking and the testing was always performed on
>>> idle systems.
>>>
>>> TCP testing yielded the following results, where the 4.4.0 kernel
>>> only got about 1/2 of the throughput:
>>>
>>
>>> Recv Send   Send  Utilization   
>>> Service Demand
>>> Socket   Socket Message Elapsed   Send Recv 
>>> SendRecv
>>> Size Size   SizeTime   Throughput localremote   
>>> local   remote
>>> bytesbytes  bytes   secs.  10^6bits/s % S  % S  
>>> us/KB   us/KB
>>>
>>> 3.4.2 13631488 13631488   895230.01  9370.2910.146.50 
>>> 0.709   0.454
>>> 4.4.0 13631488 13631488   895230.02  5314.039.14 14.31
>>> 1.127   1.765
>>>
>>> SCTP testing yielded the following results, where the 4.4.0 kernel only got 
>>> about 1/3 of the throughput:
>>>
>>> Recv Send   Send  Utilization   
>>> Service Demand
>>> Socket   Socket Message Elapsed   Send Recv 
>>> SendRecv
>>> Size Size   SizeTime   Throughput localremote   
>>> local   remote
>>> bytesbytes  bytes   secs.  10^6bits/s  % S % S  
>>> us/KB   us/KB
>>>
>>> 3.4.2 13631488 13631488   895230.00  2306.2213.8713.19
>>> 3.941   3.747
>>> 4.4.0 13631488 13631488   895230.01   882.7416.8619.14
>>> 12.516  14.210
>>>
>>> The same tests were performed a multitude of time, and are always
>>> consistent (within a few percent).  I've also tried playing with
>>> various run-time kernel parameters (/proc/sys/kernel/net/...) on the
>>> 4.4.0 kernel to alleviate the issue but have had no success at all.
>>>
>>> I'm at a loss as to what could possibly account for such a discrepancy...
>>>
>>
>> I suspect I am not alone in being curious about the CPU(s) present in the 
>> systems and the model/whatnot of the NIC being used.  I'm also curious as to 
>> why you have what at first glance seem like absurdly large socket buffer 
>> sizes.
>>
>> That said, it looks like you have some Really Big (tm) increases in service 
>> demand.  Many more CPU cycles being consumed per KB of data transferred.
>>
>> Your message size makes me wonder if you were using a 9000 byte MTU.
>>
>> Perhaps in the move from 3.4.2 to 4.4.0 you lost some or all of the 
>> stateless offloads for your NIC(s)?  Running ethtool -k  on both 
>> ends under both kernels might be good.
>>
>> Also, if you did have a 9000 byte MTU under 3.4.2 are you certain you still 
>> had it under 4.4.0?
>>
>> It would (at least to me) also be interesting to run a TCP_RR test comparing 
>> the two kernels.  TCP_RR (at least with the default request/response size of 
>> one byte) doesn't really care about stateless offloads or MTUs and could 
>> show how much difference there is in basic path length (or I suppose in 
>> interrupt coalescing behaviour if the NIC in question has a mildly dodgy 
>> heuristic for such things).
>>
>> happy benchmarking,
>>
>> rick jones
>>
>
>
> I think the issue is resolved.  I had to recompile my 4.4.0 kernel with a few 
> options pertaining to the Intel NIC which somehow (?) got left out or 
> otherwise clobbered when I ported my 3.4.2 .config to the 4.4.0 kernel source 
> tree.  With those changes now in I see essentially identical performance with 
> the two kernels.  Sorry for any confusion and/or waste of time here.  My bad.
>
>

Can you share which config options you enabled to get your performance back?

-- 
Josh


Re: [PATCH] udp6: fix UDP/IPv6 encap resubmit path

2016-03-07 Thread Josh Hunt

On 03/04/2016 04:47 PM, Bill Sommerfeld wrote:

IPv4 interprets a negative return value from a protocol handler as a
request to redispatch to a new protocol.  In contrast, IPv6 interprets a
negative value as an error, and interprets a positive value as a request
for redispatch.

UDP for IPv6 was unaware of this difference.  Change __udp6_lib_rcv() to
return a positive value for redispatch.  Note that the socket's
encap_rcv hook still needs to return a negative value to request
dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to
identify the byte containing the next protocol.

Signed-off-by: Bill Sommerfeld 
---
  net/ipv6/udp.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 0711f8f..fd25e44 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -922,11 +922,9 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table 
*udptable,
ret = udpv6_queue_rcv_skb(sk, skb);
sock_put(sk);

-   /* a return value > 0 means to resubmit the input, but
-* it wants the return to be -protocol, or 0
-*/
+   /* a return value > 0 means to resubmit the input */
if (ret > 0)
-   return -ret;
+   return ret;

return 0;
}



This looks good to me. Thanks Bill!

Josh


Re: [PATCH stable <= 3.18] net: add length argument to skb_copy_and_csum_datagram_iovec

2015-11-10 Thread Josh Hunt
On Thu, Oct 29, 2015 at 5:00 AM, Sabrina Dubroca  wrote:
> 2015-10-15, 14:25:03 +0200, Sabrina Dubroca wrote:
>> Without this length argument, we can read past the end of the iovec in
>> memcpy_toiovec because we have no way of knowing the total length of the
>> iovec's buffers.
>>
>> This is needed for stable kernels where 89c22d8c3b27 ("net: Fix skb
>> csum races when peeking") has been backported but that don't have the
>> ioviter conversion, which is almost all the stable trees <= 3.18.
>>
>> This also fixes a kernel crash for NFS servers when the client uses
>>  -onfsvers=3,proto=udp to mount the export.
>>
>> Signed-off-by: Sabrina Dubroca 
>> Reviewed-by: Hannes Frederic Sowa 
>
> Fixes CVE-2015-8019.
> http://www.openwall.com/lists/oss-security/2015/10/29/1
>
> --
> Sabrina
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Greg

Do you have this in your queue? I saw a few other stables pick this
up, but haven't seen it in 3.14 or 3.18 yet. It wasn't clear to me if
this had been fully reviewed yet.

Thanks
-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] show socket info in fdinfo

2015-07-15 Thread Josh Hunt
When sockets are unhashed they are no longer visible through the normal
methods. Provide a way to see all sockets in the system by dumping some
basic socket information in fdinfo.

Signed-off-by: Josh Hunt joh...@akamai.com
---

When trying to debug an application issue which wound up not closing sockets
properly we found there was no easy way through existing tools to show sockets
which have been unhashed. In order to get some basic information about all
allocated sockets on the system we'd like to dump this info into fdinfo.

 net/socket.c |   52 
 1 file changed, 52 insertions(+)

diff --git a/net/socket.c b/net/socket.c
index 9963a0b..eced913 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -131,6 +131,9 @@ static ssize_t sock_sendpage(struct file *file, struct page 
*page,
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
+#ifdef CONFIG_PROC_FS
+static void sock_show_fdinfo(struct seq_file *m, struct file *file);
+#endif
 
 /*
  * Socket files have a set of 'special' operations as well as the generic 
file ones. These don't appear
@@ -153,6 +156,9 @@ static const struct file_operations socket_file_ops = {
.sendpage = sock_sendpage,
.splice_write = generic_splice_sendpage,
.splice_read =  sock_splice_read,
+#ifdef CONFIG_PROC_FS
+   .show_fdinfo =  sock_show_fdinfo,
+#endif
 };
 
 /*
@@ -777,6 +783,52 @@ static ssize_t sock_splice_read(struct file *file, loff_t 
*ppos,
return sock-ops-splice_read(sock, ppos, pipe, len, flags);
 }
 
+#ifdef CONFIG_PROC_FS
+static void sock_show_fdinfo(struct seq_file *m, struct file *file)
+{
+   struct socket *sock = NULL;
+   struct sock *sk = NULL;
+   int err;
+
+   sock = sock_from_file(file, err);
+   if (sock) {
+   sk = sock-sk;
+   sock_hold(sk);
+   seq_printf(m, state:\t%d\n
+   type:\t0x%x\n
+   flags\t%lu\n
+   sk_rcvbuf:\t%d\n
+   sk_sndbuf:\t%d\n
+   sk_forward_alloc:\t%d\n
+   sk_rmem_alloc:\t%d\n
+   sk_wmem_alloc:\t%d\n
+   sk_wmem_queued:\t%d\n
+   sk_err:\t%d\n
+   sk_state:\t0x%x\n
+   sk_flags:\t0x%lx\n
+   sk_unhashed:\t%d\n
+   sk_protocol:\t0x%x\n
+   sk_family:\t0x%x\n,
+   sock-state,
+   sock-type,
+   sock-flags,
+   sk-sk_rcvbuf,
+   sk-sk_sndbuf,
+   sk-sk_forward_alloc,
+   atomic_read(sk-sk_rmem_alloc),
+   atomic_read(sk-sk_wmem_alloc),
+   sk-sk_wmem_queued,
+   sk-sk_err,
+   sk-sk_state,
+   sk-sk_flags,
+   sk_unhashed(sk),
+   sk-sk_protocol,
+   sk-sk_family);
+   sock_put(sk);
+   }
+}
+#endif
+
 static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
struct file *file = iocb-ki_filp;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] ipv6: Fix protocol resubmission

2015-06-11 Thread Josh Hunt

On 06/11/2015 08:05 AM, Hajime Tazaki wrote:


At Wed, 10 Jun 2015 16:57:25 -0500,
Josh Hunt wrote:


Dave

Can you please revert this change?

commit 0243508edd317ff1fa63b495643a7c192fbfcd92
Author: Josh Hunt joh...@akamai.com
Date:   Mon Jun 8 12:00:59 2015 -0400

  ipv6: Fix protocol resubmission

Let me know if you need a patch to do this and I will submit something.

I will fix the original issue in the UDP code in another patch.


feel free to Cc me if you would like me to test the new patch.


I will definitely take you up on that :)

Thanks
Josh
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] ipv6: Fix protocol resubmission

2015-06-10 Thread Josh Hunt

On 06/09/2015 11:24 PM, Hajime Tazaki wrote:


Hello Josh, Dave,

my mobile ipv6 test on libos failed with this commit.

This commit makes a destination option header handling (i.e.,
ipprot-handler == ipv6_destopt_rcv) failed since
ipv6_destopt_rcv() seems to return a positive value to
indicate to goto resubmission label.

I will look for more detail.

-- Hajime


Hajime

Thanks for the report. I mentioned in an earlier post this might be a 
problem.


Dave, what if we restore the old behavior, but add a new label to handle 
the case where the decapsulating protocol returns the nexthdr value? 
Allowing for migration over to this method over time. I've pasted in a 
patch doing so below.


The other solution I guess is to change how the udp handler works, but I 
was hoping to keep it behaving the same as v4.


Josh

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 41a73da..a4fab24 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -212,13 +212,13 @@ static int ip6_input_finish(struct sock *sk, 
struct sk_buff *skb)

 */

rcu_read_lock();
+resubmit:
idev = ip6_dst_idev(skb_dst(skb));
if (!pskb_pull(skb, skb_transport_offset(skb)))
goto discard;
nhoff = IP6CB(skb)-nhoff;
nexthdr = skb_network_header(skb)[nhoff];
-
-resubmit:
+resubmit_nexthdr:
raw = raw6_local_deliver(skb, nexthdr);
ipprot = rcu_dereference(inet6_protos[nexthdr]);
if (ipprot) {
@@ -246,9 +246,11 @@ resubmit:
goto discard;

ret = ipprot-handler(skb);
-   if (ret  0) {
-   nexthdr = -ret;
+   if (ret  0) {
goto resubmit;
+   } else if (ret  0) {
+   nexthdr = -ret;
+   goto resubmit_nexthdr;
} else if (ret == 0) {
IP6_INC_STATS_BH(net, idev, 
IPSTATS_MIB_INDELIVERS);

}
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] ipv6: Fix protocol resubmission

2015-06-10 Thread Josh Hunt


On 06/10/2015 10:23 AM, Josh Hunt wrote:

On 06/10/2015 10:16 AM, YOSHIFUJI Hideaki wrote:

Hi,

Josh Hunt wrote:


Hajime

Thanks for the report. I mentioned in an earlier post this might be a
problem.

Dave, what if we restore the old behavior, but add a new label to
handle the case where the decapsulating protocol returns the nexthdr
value? Allowing for migration over to this method over time. I've
pasted in a patch doing so below.


I think it is insufficient because IPv6 stack already uses
positive value, 0 and negative values.



Where does it use a negative value?

ret is only checked to be  or == to 0. I don't see any checks or code
handling a return value of  0 prior to my patch.

If something does return a negative value should it since nothing
happens with it?


Looking at the code again, I guess we don't increment
IPSTATS_MIB_INDELIVERS if we get a negative return value prior to my patch.

If so, then I guess we have to fix this in the udp path like you were 
suggesting. I will look at this and propose a patch to revert my 
original change and update the v6 udp code.


Josh
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] ipv6: Fix protocol resubmission

2015-06-10 Thread Josh Hunt

On 06/10/2015 10:16 AM, YOSHIFUJI Hideaki wrote:

Hi,

Josh Hunt wrote:

On 06/09/2015 11:24 PM, Hajime Tazaki wrote:


Hello Josh, Dave,

my mobile ipv6 test on libos failed with this commit.

This commit makes a destination option header handling (i.e.,
ipprot-handler == ipv6_destopt_rcv) failed since
ipv6_destopt_rcv() seems to return a positive value to
indicate to goto resubmission label.

I will look for more detail.

-- Hajime


Hajime

Thanks for the report. I mentioned in an earlier post this might be a problem.

Dave, what if we restore the old behavior, but add a new label to handle the 
case where the decapsulating protocol returns the nexthdr value? Allowing for 
migration over to this method over time. I've pasted in a patch doing so below.


I think it is insufficient because IPv6 stack already uses
positive value, 0 and negative values.



The other solution I guess is to change how the udp handler works, but I was 
hoping to keep it behaving the same as v4.


xfrm returns different value for IPv4 and IPv6, for example,
so udp can do in the same way.  And we can use the fact that
the size of next header field is 8-bit.



Dave

Can you please revert this change?

commit 0243508edd317ff1fa63b495643a7c192fbfcd92
Author: Josh Hunt joh...@akamai.com
Date:   Mon Jun 8 12:00:59 2015 -0400

ipv6: Fix protocol resubmission

Let me know if you need a patch to do this and I will submit something.

I will fix the original issue in the UDP code in another patch.

Josh

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] ipv6: Fix protocol resubmission

2015-06-10 Thread Josh Hunt

On 06/10/2015 10:16 AM, YOSHIFUJI Hideaki wrote:

Hi,

Josh Hunt wrote:


Hajime

Thanks for the report. I mentioned in an earlier post this might be a problem.

Dave, what if we restore the old behavior, but add a new label to handle the 
case where the decapsulating protocol returns the nexthdr value? Allowing for 
migration over to this method over time. I've pasted in a patch doing so below.


I think it is insufficient because IPv6 stack already uses
positive value, 0 and negative values.



Where does it use a negative value?

ret is only checked to be  or == to 0. I don't see any checks or code 
handling a return value of  0 prior to my patch.


If something does return a negative value should it since nothing 
happens with it?


Josh
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] ipv6: Fix protocol resubmission

2015-06-08 Thread Josh Hunt
UDP encapsulation is broken on IPv6. This is because the logic to resubmit
the nexthdr is inverted, checking for a ret value  0 instead of  0. Also,
the resubmit label is in the wrong position since we already get the
nexthdr value when performing decapsulation. In addition the skb pull is no
longer necessary either.

This changes the return value check to look for  0, using it for the
nexthdr on the next iteration, and moves the resubmit label to the proper
location.

With these changes the v6 code now matches what we do in the v4 ip input
code wrt resubmitting when decapsulating.

Signed-off-by: Josh Hunt joh...@akamai.com
---

v3: Fix formatting issues found by Sergei Shtylyov
v2: Use returned nexthdr value

 net/ipv6/ip6_input.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index f2e464e..41a73da 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -212,13 +212,13 @@ static int ip6_input_finish(struct sock *sk, struct 
sk_buff *skb)
 */
 
rcu_read_lock();
-resubmit:
idev = ip6_dst_idev(skb_dst(skb));
if (!pskb_pull(skb, skb_transport_offset(skb)))
goto discard;
nhoff = IP6CB(skb)-nhoff;
nexthdr = skb_network_header(skb)[nhoff];
 
+resubmit:
raw = raw6_local_deliver(skb, nexthdr);
ipprot = rcu_dereference(inet6_protos[nexthdr]);
if (ipprot) {
@@ -246,10 +246,12 @@ resubmit:
goto discard;
 
ret = ipprot-handler(skb);
-   if (ret  0)
+   if (ret  0) {
+   nexthdr = -ret;
goto resubmit;
-   else if (ret == 0)
+   } else if (ret == 0) {
IP6_INC_STATS_BH(net, idev, IPSTATS_MIB_INDELIVERS);
+   }
} else {
if (!raw) {
if (xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ipv6: Fix protocol resubmission

2015-06-02 Thread Josh Hunt

On 05/29/2015 09:37 PM, Josh Hunt wrote:

On 05/29/2015 04:48 PM, Tom Herbert wrote:

Hi Josh,

Why did you need to move the resubmit label?


Grabbing nhoff out of the skb's cb didn't seem relevant anymore, unless
we're requiring the decapsulating code to update the control block
before it returns. Also, since we are returning nexthdr with the return
value it seems unnecessary to go through that work when we already have it.

Not that there has to be symmetry here, but for example, the v4 code
doesn't look up the protocol field again in the inner ip header. It just
uses the protocol value returned from the protocol handler.

Also, I'm skipping the pskb_pull(), which I assumed was left up to the
decapsulating code to setup the data pointer properly before returning.
Again, this is how the v4 code behaves.

The only thing left is the idev which I'm not sure about. Could that
change b/t calls?


Tom

Do you think the above is incorrect and we should just leave the 
resubmit label where it currently is? If so, do you think we should just 
bypass the nhoff/nexthdr checks and use the nexthdr value returned by 
the protocol handler?


Thanks
Josh

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ipv6: Fix protocol resubmission

2015-05-29 Thread Josh Hunt

On 05/29/2015 04:48 PM, Tom Herbert wrote:

Hi Josh,

Why did you need to move the resubmit label?


Grabbing nhoff out of the skb's cb didn't seem relevant anymore, unless 
we're requiring the decapsulating code to update the control block 
before it returns. Also, since we are returning nexthdr with the return 
value it seems unnecessary to go through that work when we already have it.


Not that there has to be symmetry here, but for example, the v4 code 
doesn't look up the protocol field again in the inner ip header. It just 
uses the protocol value returned from the protocol handler.


Also, I'm skipping the pskb_pull(), which I assumed was left up to the 
decapsulating code to setup the data pointer properly before returning. 
Again, this is how the v4 code behaves.


The only thing left is the idev which I'm not sure about. Could that 
change b/t calls?


Josh






--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ipv6: Fix protocol resubmission

2015-05-29 Thread Josh Hunt
I came across this problem while trying to use UDP encapsulation with IPv6. The
change below fixes that, but it was not immediately apparent if there are any
other protocols relying on this broken behavior. FWIW the behavior below now
matches IPv4.

Josh
---

UDP encapsulation is broken on IPv6 because it expects when it returns a 
negative
value that the packet will be resubmitted to the stack with the handler 
corresponding
to the return value. The check currently looks for return values  0 and then 
resubmits.

This patch fixes that check and also moves the resubmit label to take advantage 
of
the return code identifying the next protocol we want to process.

Signed-off-by: Josh Hunt joh...@akamai.com
---
 net/ipv6/ip6_input.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index f2e464e..a4c1026 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -212,13 +212,13 @@ static int ip6_input_finish(struct sock *sk, struct 
sk_buff *skb)
 */
 
rcu_read_lock();
-resubmit:
idev = ip6_dst_idev(skb_dst(skb));
if (!pskb_pull(skb, skb_transport_offset(skb)))
goto discard;
nhoff = IP6CB(skb)-nhoff;
nexthdr = skb_network_header(skb)[nhoff];
 
+resubmit:
raw = raw6_local_deliver(skb, nexthdr);
ipprot = rcu_dereference(inet6_protos[nexthdr]);
if (ipprot) {
@@ -246,7 +246,7 @@ resubmit:
goto discard;
 
ret = ipprot-handler(skb);
-   if (ret  0)
+   if (ret  0)
goto resubmit;
else if (ret == 0)
IP6_INC_STATS_BH(net, idev, IPSTATS_MIB_INDELIVERS);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] ipv6: Fix protocol resubmission

2015-05-29 Thread Josh Hunt
I came across this problem while trying to use UDP encapsulation with IPv6. The
change below fixes that, but it was not immediately apparent if there are any
other protocols relying on this broken behavior. FWIW the behavior below now
matches IPv4.

Josh

v2: Actually sets nexthdr so we can use it ;)

---

UDP encapsulation is broken on IPv6 because it expects when it returns a 
negative
value that the packet will be resubmitted to the stack with the handler 
corresponding
to the return value. The check currently looks for return values  0 and then 
resubmits.

This patch fixes that check and also moves the resubmit label to take advantage 
of
the return code identifying the next protocol we want to process.

Signed-off-by: Josh Hunt joh...@akamai.com
---
 net/ipv6/ip6_input.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index f2e464e..e16c289 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -212,13 +212,13 @@ static int ip6_input_finish(struct sock *sk, struct 
sk_buff *skb)
 */
 
rcu_read_lock();
-resubmit:
idev = ip6_dst_idev(skb_dst(skb));
if (!pskb_pull(skb, skb_transport_offset(skb)))
goto discard;
nhoff = IP6CB(skb)-nhoff;
nexthdr = skb_network_header(skb)[nhoff];
 
+resubmit:
raw = raw6_local_deliver(skb, nexthdr);
ipprot = rcu_dereference(inet6_protos[nexthdr]);
if (ipprot) {
@@ -246,9 +246,10 @@ resubmit:
goto discard;
 
ret = ipprot-handler(skb);
-   if (ret  0)
+   if (ret  0) {
+   nexthdr = -ret;
goto resubmit;
-   else if (ret == 0)
+   } else if (ret == 0)
IP6_INC_STATS_BH(net, idev, IPSTATS_MIB_INDELIVERS);
} else {
if (!raw) {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ipv6: Fix protocol resubmission

2015-05-29 Thread Josh Hunt

On 05/29/2015 02:56 PM, Josh Hunt wrote:

I came across this problem while trying to use UDP encapsulation with IPv6. The
change below fixes that, but it was not immediately apparent if there are any
other protocols relying on this broken behavior. FWIW the behavior below now
matches IPv4.

Josh
---

UDP encapsulation is broken on IPv6 because it expects when it returns a 
negative
value that the packet will be resubmitted to the stack with the handler 
corresponding
to the return value. The check currently looks for return values  0 and then 
resubmits.

This patch fixes that check and also moves the resubmit label to take advantage 
of
the return code identifying the next protocol we want to process.

Signed-off-by: Josh Hunt joh...@akamai.com
---
  net/ipv6/ip6_input.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index f2e464e..a4c1026 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -212,13 +212,13 @@ static int ip6_input_finish(struct sock *sk, struct 
sk_buff *skb)
 */

rcu_read_lock();
-resubmit:
idev = ip6_dst_idev(skb_dst(skb));
if (!pskb_pull(skb, skb_transport_offset(skb)))
goto discard;
nhoff = IP6CB(skb)-nhoff;
nexthdr = skb_network_header(skb)[nhoff];

+resubmit:
raw = raw6_local_deliver(skb, nexthdr);
ipprot = rcu_dereference(inet6_protos[nexthdr]);
if (ipprot) {
@@ -246,7 +246,7 @@ resubmit:
goto discard;

ret = ipprot-handler(skb);
-   if (ret  0)
+   if (ret  0)
goto resubmit;
else if (ret == 0)
IP6_INC_STATS_BH(net, idev, IPSTATS_MIB_INDELIVERS);



Please ignore, sending v2.

Josh
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] ipv6: Fix protocol resubmission

2015-05-29 Thread Josh Hunt

On 05/29/2015 03:27 PM, Sergei Shtylyov wrote:

Hello.

On 05/29/2015 11:04 PM, Josh Hunt wrote:


I came across this problem while trying to use UDP encapsulation with
IPv6. The
change below fixes that, but it was not immediately apparent if there
are any
other protocols relying on this broken behavior. FWIW the behavior
below now
matches IPv4.



Josh


Need Signed-off-by: line insted of this.


v2: Actually sets nexthdr so we can use it ;)


Normally goes after ---, not before.


---



UDP encapsulation is broken on IPv6 because it expects when it returns
a negative
value that the packet will be resubmitted to the stack with the
handler corresponding
to the return value. The check currently looks for return values  0
and then resubmits.



This patch fixes that check and also moves the resubmit label to take
advantage of
the return code identifying the next protocol we want to process.



Signed-off-by: Josh Hunt joh...@akamai.com
---
  net/ipv6/ip6_input.c |7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)



diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index f2e464e..e16c289 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c

[...]

@@ -246,9 +246,10 @@ resubmit:
  goto discard;

  ret = ipprot-handler(skb);
-if (ret  0)
+if (ret  0) {
+nexthdr = -ret;
  goto resubmit;
-else if (ret == 0)
+} else if (ret == 0)
  IP6_INC_STATS_BH(net, idev, IPSTATS_MIB_INDELIVERS);


The CodingStyle dictates that all branches must have {} if at least
one has them.


Thanks Sergei, I appreciate the review. I will make these changes in a 
v3 after I get feedback this is the correct fix for the problem.


Josh
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix tcp fin memory accounting

2015-04-23 Thread Josh Hunt

On 04/21/2015 07:09 PM, Eric Dumazet wrote:


Note that this patch adds a deadlock possibility in some stress
situations.

If a process owning some tcp socket dies, and tcp_mem[2] is already hit,
all sk_stream_alloc_skb() can return NULL and we loop in tcp_send_fin(),
making no progress because we can not free any tcp memory.


Ugh. Thanks for fixing this Eric!

Josh
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html