Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-05 Thread Eric Dumazet
 Looks like routing by definition can not divert skbs with
 early-demux socket because input routing is not called.

Only if found socket has a valid sk-sk_rx_dst

Early demux :

1) if TCP lookup found a matching socket, we do the attachment
   skb-sk = sk;
   skb-destructor = sock_edemux

2) If sk-sk_rx_dst is set and still valid, IP routing will use this cached dst.

So it looks very possible that some packets could match a socket but
fail the 2) phase.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-05 Thread Julian Anastasov

Hello,

On Fri, 3 Jul 2015, Alex Gartrell wrote:

  - if packets go to local server IPVS should not touch
  skb-dst, skb-sk, etc (NF_ACCEPT case)
 
 Yeah, the thing is that early demux could totally match for a socket
 that existed before we created the service, and in that instance it
 might make the most sense to retain the connection and simply
 NF_ACCEPT.  The problem with that approach though is that is that the
 behavior changes if early_demux is not enabled.  I believe that we
 should just do the consistent thing and always drop the early_demux
 result if bound for non-local, as you've said.

We must not forget that a local server listening
on 0.0.0.0:VPORT or VIP:VPORT can be reached if a real
server with some local IP is used as RIP. So, early demux
will really work for this case when local stack is one
of the real servers.

 The interesting thing though is that, for the purposes of routing,
 enabling early_demux does change the behavior.  I suspect that's a
 bug, but it's far enough away from actual use cases that it's probably
 fine (who is out there tearing down addresses and setting up routes in
 their place?)

Looks like routing by definition can not divert skbs with
early-demux socket because input routing is not called.
Netfilter's DNAT may change daddr/dport before early-demux
and in this case socket should not be found (eg. if we
DNAT to other host). So, there is problem mostly for IPVS,
I don't remember for other cases. May be CLUSTERIP too,
I'm not sure. There is the problem that at LOCAL_IN
SNAT is valid operation, not sure how it affects
early-demux.

 What do you think of the following:
 
 commit f04c42f8041cc4ccc4cb2a30c1058136dd497a83
 Author: Alex Gartrell agartr...@fb.com
 Date:   Wed Jul 1 13:24:46 2015 -0700
 
 ipvs: orphan_skb in case of forwarding

skb_orphan or orphan skb

 It is possible that we bind against a local socket in early_demux when we
 are actually going to want to forward it.  In this case, the socket serves
 no purpose and only serves to confuse things (particularly functions which
 implicitly expect sk_fullsock to be true, like ip_local_out).
 Additionally, skb_set_owner_w is totally broken for non full-socks.
 
 Signed-off-by: Alex Gartrell agartr...@fb.com
 
 diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
 index bf66a86..3efe719 100644
 --- a/net/netfilter/ipvs/ip_vs_xmit.c
 +++ b/net/netfilter/ipvs/ip_vs_xmit.c
 @@ -527,6 +527,19 @@ static inline int
 ip_vs_tunnel_xmit_prepare(struct sk_buff *skb,
 return ret;
  }
 
 +/* In the event of a remote destination, it's possible that we would have
 + * matches against an old socket (particularly a TIME-WAIT socket). This
 + * causes havoc down the line (ip_local_out et. al. expect regular sockets
 + * and invalid memory accesses will happen) so simply drop the association
 + * in this case
 +*/
 +static inline void ip_vs_drop_early_demux_sk(struct sk_buff *skb) {

Move '{' on next line and below comment should be closed
on next line. But I guess you will run later
scripts/checkpatch.pl --strict /tmp/file.patch

 +   /* If dev is set, the packet came from the LOCAL_IN callback and
 +* not from a local TCP socket */
 +   if (skb-dev)
 +   skb_orphan(skb);
 +}
 +
  /* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */
  static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
  struct ip_vs_conn *cp, int local)
 @@ -539,6 +552,7 @@ static inline int ip_vs_nat_send_or_cont(int pf,
 struct sk_buff *skb,
 else
 ip_vs_update_conntrack(skb, cp, 1);
 if (!local) {
 +   ip_vs_drop_early_demux_sk(skb);
 skb_forward_csum(skb);
 NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
 NULL, skb_dst(skb)-dev, dst_output_sk);

For the local=true case in ip_vs_nat_send_or_cont may be
we should call skb_orphan when cp-dport != cp-vport or
cp-daddr != cp-vaddr. This is a case where we DNAT to
local real server but on different addr/port. If early
demux finds socket, it is some socket shadowed after adding
the virtual service. So, may be we have to add such checks
near the NF_ACCEPT code.

Can this work?

else {
/* Drop early-demux socket on DNAT */
if (cp-vport != cp-dport ||
!ip_vs_addr_equal(cp-af, cp-vaddr, cp-caddr))
ip_vs_drop_early_demux_sk(skb);
ret = NF_ACCEPT;
}

Otherwise, the other changes look good to me.

Regards

--
Julian Anastasov j...@ssi.bg
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-03 Thread Alex Gartrell
Hey

On Fri, Jul 3, 2015 at 1:32 AM, Julian Anastasov j...@ssi.bg wrote:

 To summarize:
 - we should call skb_orphan as soon as possible after
 deciding if packets goes to local or remote real server
 but only for skb-sk set by early_demux, not for packets
 sent by TCP

Yeah, agree

 - if packets go to local server IPVS should not touch
 skb-dst, skb-sk, etc (NF_ACCEPT case)

Yeah, the thing is that early demux could totally match for a socket
that existed before we created the service, and in that instance it
might make the most sense to retain the connection and simply
NF_ACCEPT.  The problem with that approach though is that is that the
behavior changes if early_demux is not enabled.  I believe that we
should just do the consistent thing and always drop the early_demux
result if bound for non-local, as you've said.

The interesting thing though is that, for the purposes of routing,
enabling early_demux does change the behavior.  I suspect that's a
bug, but it's far enough away from actual use cases that it's probably
fine (who is out there tearing down addresses and setting up routes in
their place?)

 - for skb-sk set by early_demux, skb_orphan should happen before
 skb_set_owner_w in ip_vs_prepare_tunneled_skb because
 skb_set_owner_w will try to increase sk_wmem_alloc which is
 wrong for early_demux phase

Yeah that's my thinking as well.

 - reaching skb_set_owner_w code for skb-sk set by eraly_demux
 looks wrong to me, it can happen on:
 - redirect (DNAT), if somehow we have socket too
 - IPVS redirect: if we forward both to local and remote
 real servers
 - not likely for forward, nobody forwards traffic
 destined to local IP to remote host


What do you think of the following:

commit f04c42f8041cc4ccc4cb2a30c1058136dd497a83
Author: Alex Gartrell agartr...@fb.com
Date:   Wed Jul 1 13:24:46 2015 -0700

ipvs: orphan_skb in case of forwarding

It is possible that we bind against a local socket in early_demux when we
are actually going to want to forward it.  In this case, the socket serves
no purpose and only serves to confuse things (particularly functions which
implicitly expect sk_fullsock to be true, like ip_local_out).
Additionally, skb_set_owner_w is totally broken for non full-socks.

Signed-off-by: Alex Gartrell agartr...@fb.com

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index bf66a86..3efe719 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -527,6 +527,19 @@ static inline int
ip_vs_tunnel_xmit_prepare(struct sk_buff *skb,
return ret;
 }

+/* In the event of a remote destination, it's possible that we would have
+ * matches against an old socket (particularly a TIME-WAIT socket). This
+ * causes havoc down the line (ip_local_out et. al. expect regular sockets
+ * and invalid memory accesses will happen) so simply drop the association
+ * in this case
+*/
+static inline void ip_vs_drop_early_demux_sk(struct sk_buff *skb) {
+   /* If dev is set, the packet came from the LOCAL_IN callback and
+* not from a local TCP socket */
+   if (skb-dev)
+   skb_orphan(skb);
+}
+
 /* return NF_STOLEN (sent) or NF_ACCEPT if local=1 (not sent) */
 static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb,
 struct ip_vs_conn *cp, int local)
@@ -539,6 +552,7 @@ static inline int ip_vs_nat_send_or_cont(int pf,
struct sk_buff *skb,
else
ip_vs_update_conntrack(skb, cp, 1);
if (!local) {
+   ip_vs_drop_early_demux_sk(skb);
skb_forward_csum(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
NULL, skb_dst(skb)-dev, dst_output_sk);
@@ -557,6 +571,7 @@ static inline int ip_vs_send_or_cont(int pf,
struct sk_buff *skb,
if (likely(!(cp-flags  IP_VS_CONN_F_NFCT)))
ip_vs_notrack(skb);
if (!local) {
+   ip_vs_drop_early_demux_sk(skb);
skb_forward_csum(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, NULL, skb,
NULL, skb_dst(skb)-dev, dst_output_sk);
@@ -845,6 +860,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
struct ipv6hdr *old_ipv6h = NULL;
 #endif

+   ip_vs_drop_early_demux_sk(skb);
+
if (skb_headroom(skb)  max_headroom || skb_cloned(skb)) {
new_skb = skb_realloc_headroom(skb, max_headroom);
if (!new_skb)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-03 Thread Julian Anastasov

Hello,

On Thu, 2 Jul 2015, Alex Gartrell wrote:

 On Thu, Jul 2, 2015 at 2:18 PM, Alex Gartrell alexgartr...@gmail.com wrote:
  If early demux was enabled, we'd use the route from the socket
 
 Actually now that I think about it, this is probably broken, because
 we don't reply to the packet but instead silently drop it.

I think, the problem is that input packet takes
the output path, not a problem with its sk_state.

Here is how I understand the situation:

Input packet:

ip_rcv_finish:
- early_demux:
- attach skb-sk, skb-destructor, skb-dst
- skb size is accounted later to sk

LOCAL_IN:
- IPVS Remote Client mode: ip_vs_in
- Case 1: IPVS forward to local real server (NF_ACCEPT case):
- we are going to hit local server, so we should keep
skb-sk, etc
- skb-dst not changed
- return NF_ACCEPT
- continue to TCP stack
- Case 2: IPVS forward TUN for example:
- skb_orphan
- we should not work with any skb-sk from input path
- attach new skb-dst for the new real server
- ip_local_out

TCP:
- if IPVS does not grab the packet above, we have
a call to skb_set_owner_r to account the skb size to sk


Locally generated TCP packet:

TCP tcp_transmit_skb:
- attach skb-sk, skb-destructor, increase sk_wmem_alloc

LOCAL_OUT:
- IPVS Local Client mode: again ip_vs_in
- skb-dev is NULL means locally generated packet, skb-sk can
be set by TCP
- Case 1: IPVS forward to local real server (NF_ACCEPT case):
- we are going to hit local server, so we should keep
skb-sk, etc
- skb-dst not changed
- return NF_ACCEPT
- continue to TCP stack in LOCAL_IN
- Case 2: IPVS forward TUN for example:
- no skb_orphan because skb-sk is for
output path on skb-dev == NULL
- realloc headroom (call skb_set_owner_w): should happen
only for skb-sk from output path. This code in
ip_vs_prepare_tunneled_skb comes from old days from
ipip.c where skb-sk is present for locally generated
packets and IPIP's xmit routine reallocs headroom
- attach new skb-dst for the new real server
- ip_local_out

To summarize:
- we should call skb_orphan as soon as possible after
deciding if packets goes to local or remote real server
but only for skb-sk set by early_demux, not for packets
sent by TCP
- if packets go to local server IPVS should not touch
skb-dst, skb-sk, etc (NF_ACCEPT case)
- for skb-sk set by early_demux, skb_orphan should happen before
skb_set_owner_w in ip_vs_prepare_tunneled_skb because
skb_set_owner_w will try to increase sk_wmem_alloc which is
wrong for early_demux phase
- reaching skb_set_owner_w code for skb-sk set by eraly_demux
looks wrong to me, it can happen on:
- redirect (DNAT), if somehow we have socket too
- IPVS redirect: if we forward both to local and remote
real servers
- not likely for forward, nobody forwards traffic
destined to local IP to remote host

Regards

--
Julian Anastasov j...@ssi.bg
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-02 Thread Julian Anastasov

Hello,

On Wed, 1 Jul 2015, Alex Gartrell wrote:

 On Wed, Jul 1, 2015 at 4:26 PM, Eric Dumazet eduma...@google.com wrote:
  I think you are mistaken Alex.
 
 Indeed, I was!  Should be unsurpising.
 
 
  socket early demux cannot possibly set skb-destructor to sock_rfree()
 
 Yeah I will admit adding the code to sock_rfree reflexively out of paranoia.
 
  If skb-destructor is set by early demux, it correctly points to 
  sock_edemux()
 
  And this one correctly handles all socket variants.
 
 Yes, the problem appears to be in ip_vs_prepare_tunneled_skb
 (ip_vs_xmit.c:859 in 4.0)
 
 if (skb_headroom(skb)  max_headroom || skb_cloned(skb)) {
 new_skb = skb_realloc_headroom(skb, max_headroom);
 if (!new_skb)
 goto error;
 if (skb-sk)
 skb_set_owner_w(new_skb, skb-sk);
 consume_skb(skb);
 skb = new_skb;
 }
 
 skb_set_owner_w sets sock_wfree.
 
 I'll figure out how to ensure that we're using an appropriate destructor here.

Alex, in our discussion on January I thought
we can skip calling skb_orphan for some cases but as
input and output path use different skb-destructor
we should call skb_orphan for every method, in every
case when skb-dev != NULL, even when we do not call
LOCAL_OUT, i.e. when NF_ACCEPT is returned for traffic
to local real server. We should not call it only for
local socket (skb-dev == NULL).

I think, your patch from January is almost
good:

http://archive.linuxvirtualserver.org/html/lvs-devel/2015-01/msg00014.html

Just add skb-dev check and we should be fine.
And the patch from Eric for IPVS looks good too.

Regards

--
Julian Anastasov j...@ssi.bg
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-02 Thread Eric Dumazet
On Thu, 2015-07-02 at 14:18 -0700, Alex Gartrell wrote:
 On Thu, Jul 2, 2015 at 1:44 AM, Julian Anastasov j...@ssi.bg wrote:
  I think, your patch from January is almost
  good:
 
 I'll rebase it, add your other suggestions, test it, and send it in.
 
  And the patch from Eric for IPVS looks good too.
 
 Are we sure that we want to change the semantics of set_owner_w to
 orphan it?  It works for us but that's not the behavior I'd expect
 from that function and might burn someone later?

I do not understand the concern.

skb_set_owner_w() callers are attempting to :

1) Remove association of a previous socket (skb_orphan()), if it was
there (while most skb at this point are not associated with a previous
socket)

2) Attach skb to a socket.

My fix makes sure this new socket is not a timewait or request sock.

This could happen when routes are changed in a malicious way,
because in early demux, socket dst cache is not valid anymore,
but we keep skb-sk set.

(This could happen without ipvs being in the picture I think)

Bug could happen for example if 
A) GRO cooks a GRO packet
B) we find a timewait socket and attach it to skb (and soon we also
might find a syn_recv socket)
C) Route decides to forward packet
D) output interface needs to add some headroom, check for example
net/ipv6/ip6_gre.c around lines 699





--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-02 Thread Julian Anastasov

Hello,

On Thu, 2 Jul 2015, Julian Anastasov wrote:

   Alex, in our discussion on January I thought
 we can skip calling skb_orphan for some cases but as
 input and output path use different skb-destructor
 we should call skb_orphan for every method, in every
 case when skb-dev != NULL, even when we do not call
 LOCAL_OUT, i.e. when NF_ACCEPT is returned for traffic
 to local real server. We should not call it only for
 local socket (skb-dev == NULL).
 
   I think, your patch from January is almost
 good:
 
 http://archive.linuxvirtualserver.org/html/lvs-devel/2015-01/msg00014.html
 
   Just add skb-dev check and we should be fine.

Sorry, I overlooked the problem. Above is not
correct because we can avoid the skb_orphan call
when 'local' is true. ip_vs_nat_send_or_cont should
call skb_orphan even for local=true while for TUN
it should be before ip_vs_prepare_tunneled_skb.
All other methods should avoid skb_orphan if
local=true or skb-dev is NULL.

Regards

--
Julian Anastasov j...@ssi.bg
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-02 Thread Alex Gartrell
On Thu, Jul 2, 2015 at 1:44 AM, Julian Anastasov j...@ssi.bg wrote:
 I think, your patch from January is almost
 good:

I'll rebase it, add your other suggestions, test it, and send it in.

 And the patch from Eric for IPVS looks good too.

Are we sure that we want to change the semantics of set_owner_w to
orphan it?  It works for us but that's not the behavior I'd expect
from that function and might burn someone later?

I've actually been looking through the code more for other uses of
set_owner_w and I noticed this weird quirk:

The test was simple:
0) Enable ip_forward
1) Add an address to loopback and listen on it
2) Accept a connection and close it (creating a TIME-WAIT socket)
3) Add a new route to a gre tunnel

If early demux was enabled, we'd use the route from the socket
If early demux was disabled, we'd forward using the gre tunnel

Should we just replicate this behavior in ipvs?

if (!skb-dev  skb-sk) return NF_ACCEPT;

-- 
Alex Gartrell agartr...@fb.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-02 Thread Alex Gartrell
On Thu, Jul 2, 2015 at 2:18 PM, Alex Gartrell alexgartr...@gmail.com wrote:
 If early demux was enabled, we'd use the route from the socket

Actually now that I think about it, this is probably broken, because
we don't reply to the packet but instead silently drop it.

-- 
Alex Gartrell agartr...@fb.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-01 Thread Alex Gartrell
If we early-demux bind a TCP_TIMEWAIT socket to an skb and then orphan it
(as we need to do in the ipvs forwarding case), sock_wfree and sock_rfree
are going to reach into the inet_timewait_sock and mess with fields that
don't exist.

Signed-off-by: Alex Gartrell agartr...@fb.com
---
 net/core/sock.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index 1e1fe9a..b37328f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1620,6 +1620,9 @@ void sock_wfree(struct sk_buff *skb)
struct sock *sk = skb-sk;
unsigned int len = skb-truesize;
 
+   if (sk-sk_state == TCP_TIME_WAIT)
+   return;
+
if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
/*
 * Keep a reference on sk_wmem_alloc, this will be released
@@ -1665,6 +1668,9 @@ void sock_rfree(struct sk_buff *skb)
struct sock *sk = skb-sk;
unsigned int len = skb-truesize;
 
+   if (sk-sk_state == TCP_TIME_WAIT)
+   return;
+
atomic_sub(len, sk-sk_rmem_alloc);
sk_mem_uncharge(sk, len);
 }
-- 
Alex Gartrell agartr...@fb.com

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-01 Thread David Miller
From: Alex Gartrell agartr...@fb.com
Date: Wed, 1 Jul 2015 13:13:09 -0700

 If we early-demux bind a TCP_TIMEWAIT socket to an skb and then orphan it
 (as we need to do in the ipvs forwarding case), sock_wfree and sock_rfree
 are going to reach into the inet_timewait_sock and mess with fields that
 don't exist.
 
 Signed-off-by: Alex Gartrell agartr...@fb.com

If we're forwarding, we should not find a local socket, period.

We should only match sockets for locally destined packets.

So I'd say that the state in which you say this can occur is illegal.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-01 Thread Eric Dumazet
On Wed, Jul 1, 2015 at 11:14 PM, David Miller da...@davemloft.net wrote:
 From: Alex Gartrell agartr...@fb.com
 Date: Wed, 1 Jul 2015 13:13:09 -0700

 If we early-demux bind a TCP_TIMEWAIT socket to an skb and then orphan it
 (as we need to do in the ipvs forwarding case), sock_wfree and sock_rfree
 are going to reach into the inet_timewait_sock and mess with fields that
 don't exist.

 Signed-off-by: Alex Gartrell agartr...@fb.com

 If we're forwarding, we should not find a local socket, period.

 We should only match sockets for locally destined packets.

 So I'd say that the state in which you say this can occur is illegal.

Right, this patch is totally buggy.

A socket cannot change state to TCP_TIMEWAIT.

A new object is allocated and old one is removed from ehash, then
freed (rcu rules being applied)

Also sock_wfree() has nothing to do with early demux. It is for output
path skbs only.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-01 Thread Eric Dumazet
On Thu, 2015-07-02 at 01:26 +0200, Eric Dumazet wrote:
 On Thu, Jul 2, 2015 at 1:18 AM, Alex Gartrell alexgartr...@gmail.com wrote:
  On Wednesday, July 1, 2015, Eric Dumazet eduma...@google.com wrote:
 
  On Wed, Jul 1, 2015 at 11:14 PM, David Miller da...@davemloft.net wrote:
   From: Alex Gartrell agartr...@fb.com
   Date: Wed, 1 Jul 2015 13:13:09 -0700
  
   If we early-demux bind a TCP_TIMEWAIT socket to an skb and then orphan
   it
   (as we need to do in the ipvs forwarding case), sock_wfree and
   sock_rfree
   are going to reach into the inet_timewait_sock and mess with fields
   that
   don't exist.
  
   Signed-off-by: Alex Gartrell agartr...@fb.com
  
   If we're forwarding, we should not find a local socket, period.
 
  A socket cannot change state to TCP_TIMEWAIT.
 
  A new object is allocated and old one is removed from ehash, then
  freed (rcu rules being applied)
 
  Also sock_wfree() has nothing to do with early demux. It is for output
  path skbs only.
 
 
  Alright I kind of cheated and didn't include full context here. The problem
  is that within ipvs we are getting  packets that have been early demuxed and
  associated with time wait sockets which we then wish to forward immediately
  (ip_vs_xmit.c).  Under normal circumstances it would never be associated
  with any sk at all, but it is because of early demux, so we want to drop the
  relationship by calling skb_orphan.  This invokes the destructor which lands
  us there.
 
  So that is how we reach this illegal treating a twsk like an sk state.
 
  If there is a better way to drop the association than skb_orphan I will use
  it.
 
 I think you are mistaken Alex.
 
 socket early demux cannot possibly set skb-destructor to sock_rfree()
 
 If skb-destructor is set by early demux, it correctly points to sock_edemux()
 
 And this one correctly handles all socket variants.


If ipvs is the problem, could you try instead following patch ?

Shoot in the dark, as you do not give a lot of details :(

diff --git a/include/net/sock.h b/include/net/sock.h
index 05a8c1aea251..f77fe9acc7a4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1932,6 +1932,14 @@ static inline void skb_set_hash_from_sk(struct sk_buff 
*skb, struct sock *sk)
}
 }
 
+/* This helper checks if a socket is a full socket,
+ * ie _not_ a timewait or request socket.
+ */
+static inline bool sk_fullsock(const struct sock *sk)
+{
+   return (1  sk-sk_state)  ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
+}
+
 /*
  * Queue a received datagram if it will fit. Stream and sequenced
  * protocols can't normally use this as they need to fit buffers in
@@ -1944,6 +1952,9 @@ static inline void skb_set_hash_from_sk(struct sk_buff 
*skb, struct sock *sk)
 static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 {
skb_orphan(skb);
+   if (unlikely(!sk_fullsock(sk))
+   return;
+
skb-sk = sk;
skb-destructor = sock_wfree;
skb_set_hash_from_sk(skb, sk);
@@ -2204,14 +2215,6 @@ static inline struct sock *skb_steal_sock(struct sk_buff 
*skb)
return NULL;
 }
 
-/* This helper checks if a socket is a full socket,
- * ie _not_ a timewait or request socket.
- */
-static inline bool sk_fullsock(const struct sock *sk)
-{
-   return (1  sk-sk_state)  ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
-}
-
 void sock_enable_timestamp(struct sock *sk, int flag);
 int sock_get_timestamp(struct sock *, struct timeval __user *);
 int sock_get_timestampns(struct sock *, struct timespec __user *);
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 5d2b806a862e..ff05ec5a9016 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1161,9 +1161,10 @@ ip_vs_out(unsigned int hooknum, struct sk_buff *skb, int 
af)
if (unlikely(skb-sk != NULL  hooknum == NF_INET_LOCAL_OUT 
 af == AF_INET)) {
struct sock *sk = skb-sk;
-   struct inet_sock *inet = inet_sk(skb-sk);
 
-   if (inet  sk-sk_family == PF_INET  inet-nodefrag)
+   if (sk_fullsock(sk) 
+   sk-sk_family == PF_INET 
+   inet_sk(sk)-nodefrag)
return NF_ACCEPT;
}
 
@@ -1640,9 +1641,10 @@ ip_vs_in(unsigned int hooknum, struct sk_buff *skb, int 
af)
if (unlikely(skb-sk != NULL  hooknum == NF_INET_LOCAL_OUT 
 af == AF_INET)) {
struct sock *sk = skb-sk;
-   struct inet_sock *inet = inet_sk(skb-sk);
 
-   if (inet  sk-sk_family == PF_INET  inet-nodefrag)
+   if (sk_fullsock(sk) 
+   sk-sk_family == PF_INET 
+   inet_sk(sk)-nodefrag)
return NF_ACCEPT;
}
 

 


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-01 Thread Eric Dumazet
On Thu, Jul 2, 2015 at 1:18 AM, Alex Gartrell alexgartr...@gmail.com wrote:
 On Wednesday, July 1, 2015, Eric Dumazet eduma...@google.com wrote:

 On Wed, Jul 1, 2015 at 11:14 PM, David Miller da...@davemloft.net wrote:
  From: Alex Gartrell agartr...@fb.com
  Date: Wed, 1 Jul 2015 13:13:09 -0700
 
  If we early-demux bind a TCP_TIMEWAIT socket to an skb and then orphan
  it
  (as we need to do in the ipvs forwarding case), sock_wfree and
  sock_rfree
  are going to reach into the inet_timewait_sock and mess with fields
  that
  don't exist.
 
  Signed-off-by: Alex Gartrell agartr...@fb.com
 
  If we're forwarding, we should not find a local socket, period.

 A socket cannot change state to TCP_TIMEWAIT.

 A new object is allocated and old one is removed from ehash, then
 freed (rcu rules being applied)

 Also sock_wfree() has nothing to do with early demux. It is for output
 path skbs only.


 Alright I kind of cheated and didn't include full context here. The problem
 is that within ipvs we are getting  packets that have been early demuxed and
 associated with time wait sockets which we then wish to forward immediately
 (ip_vs_xmit.c).  Under normal circumstances it would never be associated
 with any sk at all, but it is because of early demux, so we want to drop the
 relationship by calling skb_orphan.  This invokes the destructor which lands
 us there.

 So that is how we reach this illegal treating a twsk like an sk state.

 If there is a better way to drop the association than skb_orphan I will use
 it.

I think you are mistaken Alex.

socket early demux cannot possibly set skb-destructor to sock_rfree()

If skb-destructor is set by early demux, it correctly points to sock_edemux()

And this one correctly handles all socket variants.

/* All sockets share common refcount, but have different destructors */
void sock_gen_put(struct sock *sk)
{
if (!atomic_dec_and_test(sk-sk_refcnt))
return;

if (sk-sk_state == TCP_TIME_WAIT)
inet_twsk_free(inet_twsk(sk));
else if (sk-sk_state == TCP_NEW_SYN_RECV)
reqsk_free(inet_reqsk(sk));
else
sk_free(sk);
}
EXPORT_SYMBOL_GPL(sock_gen_put);

void sock_edemux(struct sk_buff *skb)
{
sock_gen_put(skb-sk);
}
EXPORT_SYMBOL(sock_edemux);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: bail on sock_wfree, sock_rfree when we have a TCP_TIMEWAIT sk

2015-07-01 Thread Alex Gartrell
On Wed, Jul 1, 2015 at 4:26 PM, Eric Dumazet eduma...@google.com wrote:
 I think you are mistaken Alex.

Indeed, I was!  Should be unsurpising.


 socket early demux cannot possibly set skb-destructor to sock_rfree()

Yeah I will admit adding the code to sock_rfree reflexively out of paranoia.

 If skb-destructor is set by early demux, it correctly points to sock_edemux()

 And this one correctly handles all socket variants.

Yes, the problem appears to be in ip_vs_prepare_tunneled_skb
(ip_vs_xmit.c:859 in 4.0)

if (skb_headroom(skb)  max_headroom || skb_cloned(skb)) {
new_skb = skb_realloc_headroom(skb, max_headroom);
if (!new_skb)
goto error;
if (skb-sk)
skb_set_owner_w(new_skb, skb-sk);
consume_skb(skb);
skb = new_skb;
}

skb_set_owner_w sets sock_wfree.

I'll figure out how to ensure that we're using an appropriate destructor here.

Appreciate the patience!

-- 
Alex Gartrell agartr...@fb.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html