Re: net/arp: ARP cache aging failed.

2016-12-14 Thread YueHaibing

On 2016/12/15 4:15, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Wed, 14 Dec 2016, YueHaibing wrote:
> 
>> On 2016/11/26 4:40, Julian Anastasov wrote:
>>>
>>> So, the idea is to move TCP and other similar
>>> users to the new dst_confirm_sk() method. If other
>>> dst_confirm() users are left, they should be checked
>>> if dsts with rt_gateway = 0 can be wrongly used.
>>
>> Sorry for so late.
> 
>   In fact, I'm late too because I almost finished
> my changes, the only remaining part is the cxgb files...
> 
>> Based on your ideas, I make a patch and test it for a while.
> 
>   The problem is that it is valid only for TCP.
> Also, this flag should be reset sometimes, eg. when sk dst
> changes...
> 
>> It works for me.
>>
>> @@ -847,7 +847,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct 
>> msghdr *msg, size_t len)
>>  return err;
>>
>>  do_confirm:
>> -dst_confirm(>dst);
>> +dst_confirm_sk(sk);
> 
>   MSG_CONFIRM from sendmsg needs special treatment. The
> problem is that UDP sending does not lock the socket, so I also
> added a skb flag to handle this situation in ip*_append_data.
> We do not want threaded application firing at different
> destinations to confirm the wrong neighbour. MSG_PROBE is
> another issue, the XFRM dst chaining, etc...
> 
>   I hope, I'll be ready this weekend with few patches
> that change all dst_confirm users... There I'll explain
> everything in detail.
> 
> Regards
> 
> --
> Julian Anastasov 
> 
> .
> 

Great, I'll be glad to do a testing.



Re: net/arp: ARP cache aging failed.

2016-12-14 Thread Julian Anastasov

Hello,

On Wed, 14 Dec 2016, YueHaibing wrote:

> On 2016/11/26 4:40, Julian Anastasov wrote:
> > 
> > So, the idea is to move TCP and other similar
> > users to the new dst_confirm_sk() method. If other
> > dst_confirm() users are left, they should be checked
> > if dsts with rt_gateway = 0 can be wrongly used.
> 
> Sorry for so late.

In fact, I'm late too because I almost finished
my changes, the only remaining part is the cxgb files...

> Based on your ideas, I make a patch and test it for a while.

The problem is that it is valid only for TCP.
Also, this flag should be reset sometimes, eg. when sk dst
changes...

> It works for me.
> 
> @@ -847,7 +847,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr 
> *msg, size_t len)
>   return err;
> 
>  do_confirm:
> - dst_confirm(>dst);
> + dst_confirm_sk(sk);

MSG_CONFIRM from sendmsg needs special treatment. The
problem is that UDP sending does not lock the socket, so I also
added a skb flag to handle this situation in ip*_append_data.
We do not want threaded application firing at different
destinations to confirm the wrong neighbour. MSG_PROBE is
another issue, the XFRM dst chaining, etc...

I hope, I'll be ready this weekend with few patches
that change all dst_confirm users... There I'll explain
everything in detail.

Regards

--
Julian Anastasov 


Re: net/arp: ARP cache aging failed.

2016-12-14 Thread YueHaibing
On 2016/11/26 4:40, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Fri, 25 Nov 2016, Hannes Frederic Sowa wrote:
> 
>> On 25.11.2016 09:18, Julian Anastasov wrote:
>>>
>>> Another option would be to add similar bit to
>>> struct sock (sk_dst_pending_confirm), may be ip_finish_output2
>>> can propagate it to dst_neigh_output via new arg and then to
>>> reset it.
>>
>> I don't understand how this can help? Maybe I understood it wrong?
> 
>   The idea is, the indication from highest level (TCP) to
> be propagated to lowest level (neigh).
> 
>>> Or to change n->confirmed via some new inline sock
>>> function instead of changing dst_neigh_output. At this place 
>>> skb->sk is optional, may be we need (skb->sk && dst ==
>>> skb->sk->sk_dst_cache) check. Also, when sk_dst_cache changes
>>> we should clear this flag.
>>
>> In "(skb->sk && dst == skb->sk->sk_dst_cache)" where does dst come from?
> 
>   This is in case we do not want to propagate indication
> from TCP to tunnels, see below...
> 
>> I don't see a possibility besides using mac_header or inner_mac_header
>> to look up the incoming MAC address and confirm that one in the neighbor
>> cache so far (we could try to optimize this case for rt_gateway though).
> 
>   My idea is as follows:
> 
> - TCP instead of calling dst_confirm should call new func
> dst_confirm_sk(sk) where sk->sk_dst_pending_confirm is set,
> not dst->pending_confirm.
> 
> - ip_finish_output2: use skb->sk (TCP) to check for
> sk_dst_pending_confirm and update n->confirmed in some
> new inline func
> 
>   Correct me if I'm wrong but here is how I see the
> situation:
> 
> - TCP caches output dst in socket, for example, dst1,
> sets it as skb->dst
> 
> - if xfrm tunnels are used then dst1 can point to some tunnel,
> i.e. it is not in all cases the same skb->dst, as seen by 
> ip_finish_output2
> 
> - netfilter can DNAT and assign different skb->dst
> 
> - as result, before now, dst1->pending_confirm is set but
> it is not used properly because ip_finish_output2 uses
> the final skb->dst which can be different or with
> rt_gateway = 0
> 
> - considering that tunnels do not use dst_confirm,
> n->confirmed is not changed every time. There are some
> interesting cases:
> 
> 1. both dst1 and the final skb->dst point to same
> dst with rt_gateway = 0: n->confirmed is updated but
> as wee see it can be for wrong neigh.
> 
> 2. dst1 is different from skb->dst, so n->confirmed is
> not changed. This can happen on DNAT or when using
> tunnel to secure gateway.
> 
> - in ip_finish_output2 we have skb->sk (original TCP sk) and
> sk arg (equal to skb->sk or second option is sock of some UDP
> tunnel, etc). The idea is to use skb->sk->sk_dst_pending_confirm,
> i.e. highest level sk because the sk arg, if different, does not
> have such flag set (tunnels do not call dst_confirm).
> 
> - ip_finish_output2 should call new func dst_neigh_confirm_sk
> (which has nothing related to dst, hm... better name is needed):
> 
>   if (!IS_ERR(neigh)) {
> - int res = dst_neigh_output(dst, neigh, skb);
> + int res;
> +
> + dst_neigh_confirm_sk(skb->sk, neigh);
> + res = dst_neigh_output(dst, neigh, skb);
> 
>   which should do something like this:
> 
>   if (sk && sk->sk_dst_pending_confirm) {
>   unsigned long now = jiffies;
> 
>   sk->sk_dst_pending_confirm = 0;
>   /* avoid dirtying neighbour */
>   if (n->confirmed != now)
>   n->confirmed = now;
>   }
> 
>   Additional dst == skb->sk->sk_dst_cache check will
> not propagate the indication on DNAT/tunnel. For me,
> it is better without such check.
> 
>   So, the idea is to move TCP and other similar
> users to the new dst_confirm_sk() method. If other
> dst_confirm() users are left, they should be checked
> if dsts with rt_gateway = 0 can be wrongly used.
> 
> Regards
> 
> --
> Julian Anastasov 
> 
> .
> 

Sorry for so late.

Based on your ideas, I make a patch and test it for a while.

It works for me.

>From dc033fe4bac8931590e15774a298f04ccea8ed4c Mon Sep 17 00:00:00 2001
From: YueHabing 
Date: Wed, 14 Dec 2016 02:43:51 -0500
Subject: [PATCH] arp: do neigh confirm based on sk arg

Signed-off-by: YueHabing 
---
 include/net/sock.h | 18 ++
 net/ipv4/ip_output.c   |  5 -
 net/ipv4/ping.c|  2 +-
 net/ipv4/raw.c |  2 +-
 net/ipv4/tcp_input.c   |  8 ++--
 net/ipv4/tcp_metrics.c |  5 +++--
 net/ipv4/udp.c |  2 +-
 net/ipv6/raw.c |  2 +-
 net/ipv6/route.c   |  6 +++---
 net/ipv6/udp.c |  2 +-
 10 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 92b2697..ece8dfa 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -434,6 +434,7 @@ struct sock {
struct sk_buff  *sk_send_head;
__s32 

Re: net/arp: ARP cache aging failed.

2016-11-25 Thread Julian Anastasov

Hello,

On Fri, 25 Nov 2016, Hannes Frederic Sowa wrote:

> On 25.11.2016 09:18, Julian Anastasov wrote:
> > 
> > Another option would be to add similar bit to
> > struct sock (sk_dst_pending_confirm), may be ip_finish_output2
> > can propagate it to dst_neigh_output via new arg and then to
> > reset it.
> 
> I don't understand how this can help? Maybe I understood it wrong?

The idea is, the indication from highest level (TCP) to
be propagated to lowest level (neigh).

> > Or to change n->confirmed via some new inline sock
> > function instead of changing dst_neigh_output. At this place 
> > skb->sk is optional, may be we need (skb->sk && dst ==
> > skb->sk->sk_dst_cache) check. Also, when sk_dst_cache changes
> > we should clear this flag.
> 
> In "(skb->sk && dst == skb->sk->sk_dst_cache)" where does dst come from?

This is in case we do not want to propagate indication
from TCP to tunnels, see below...

> I don't see a possibility besides using mac_header or inner_mac_header
> to look up the incoming MAC address and confirm that one in the neighbor
> cache so far (we could try to optimize this case for rt_gateway though).

My idea is as follows:

- TCP instead of calling dst_confirm should call new func
dst_confirm_sk(sk) where sk->sk_dst_pending_confirm is set,
not dst->pending_confirm.

- ip_finish_output2: use skb->sk (TCP) to check for
sk_dst_pending_confirm and update n->confirmed in some
new inline func

Correct me if I'm wrong but here is how I see the
situation:

- TCP caches output dst in socket, for example, dst1,
sets it as skb->dst

- if xfrm tunnels are used then dst1 can point to some tunnel,
i.e. it is not in all cases the same skb->dst, as seen by 
ip_finish_output2

- netfilter can DNAT and assign different skb->dst

- as result, before now, dst1->pending_confirm is set but
it is not used properly because ip_finish_output2 uses
the final skb->dst which can be different or with
rt_gateway = 0

- considering that tunnels do not use dst_confirm,
n->confirmed is not changed every time. There are some
interesting cases:

1. both dst1 and the final skb->dst point to same
dst with rt_gateway = 0: n->confirmed is updated but
as wee see it can be for wrong neigh.

2. dst1 is different from skb->dst, so n->confirmed is
not changed. This can happen on DNAT or when using
tunnel to secure gateway.

- in ip_finish_output2 we have skb->sk (original TCP sk) and
sk arg (equal to skb->sk or second option is sock of some UDP
tunnel, etc). The idea is to use skb->sk->sk_dst_pending_confirm,
i.e. highest level sk because the sk arg, if different, does not
have such flag set (tunnels do not call dst_confirm).

- ip_finish_output2 should call new func dst_neigh_confirm_sk
(which has nothing related to dst, hm... better name is needed):

if (!IS_ERR(neigh)) {
-   int res = dst_neigh_output(dst, neigh, skb);
+   int res;
+
+   dst_neigh_confirm_sk(skb->sk, neigh);
+   res = dst_neigh_output(dst, neigh, skb);

which should do something like this:

if (sk && sk->sk_dst_pending_confirm) {
unsigned long now = jiffies;

sk->sk_dst_pending_confirm = 0;
/* avoid dirtying neighbour */
if (n->confirmed != now)
n->confirmed = now;
}

Additional dst == skb->sk->sk_dst_cache check will
not propagate the indication on DNAT/tunnel. For me,
it is better without such check.

So, the idea is to move TCP and other similar
users to the new dst_confirm_sk() method. If other
dst_confirm() users are left, they should be checked
if dsts with rt_gateway = 0 can be wrongly used.

Regards

--
Julian Anastasov 


Re: net/arp: ARP cache aging failed.

2016-11-25 Thread Hannes Frederic Sowa
On 25.11.2016 09:18, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Thu, 24 Nov 2016, Hannes Frederic Sowa wrote:
> 
>> I think some people are thinking about it already (me also ;) ).
>>
>> But it is not easy to come up with a solution. First of all, we need to
>> look up the L2 address again in the neighbor cache and confirm the
>> appropriate neighbor. Secondly we should only do that for packets which
>> we can actually confirm (that means passing the TCP recv tests or some
>> other kind of confirmation besides simply spamming the box etc). Also it
>> needs to be fast.
> 
>   Another option would be to add similar bit to
> struct sock (sk_dst_pending_confirm), may be ip_finish_output2
> can propagate it to dst_neigh_output via new arg and then to
> reset it.

I don't understand how this can help? Maybe I understood it wrong?

> Or to change n->confirmed via some new inline sock
> function instead of changing dst_neigh_output. At this place 
> skb->sk is optional, may be we need (skb->sk && dst ==
> skb->sk->sk_dst_cache) check. Also, when sk_dst_cache changes
> we should clear this flag.

In "(skb->sk && dst == skb->sk->sk_dst_cache)" where does dst come from?

I don't see a possibility besides using mac_header or inner_mac_header
to look up the incoming MAC address and confirm that one in the neighbor
cache so far (we could try to optimize this case for rt_gateway though).

Bye,
Hannes



Re: net/arp: ARP cache aging failed.

2016-11-25 Thread Julian Anastasov

Hello,

On Thu, 24 Nov 2016, Hannes Frederic Sowa wrote:

> I think some people are thinking about it already (me also ;) ).
> 
> But it is not easy to come up with a solution. First of all, we need to
> look up the L2 address again in the neighbor cache and confirm the
> appropriate neighbor. Secondly we should only do that for packets which
> we can actually confirm (that means passing the TCP recv tests or some
> other kind of confirmation besides simply spamming the box etc). Also it
> needs to be fast.

Another option would be to add similar bit to
struct sock (sk_dst_pending_confirm), may be ip_finish_output2
can propagate it to dst_neigh_output via new arg and then to
reset it. Or to change n->confirmed via some new inline sock
function instead of changing dst_neigh_output. At this place 
skb->sk is optional, may be we need (skb->sk && dst ==
skb->sk->sk_dst_cache) check. Also, when sk_dst_cache changes
we should clear this flag.

Regards

--
Julian Anastasov 


Re: net/arp: ARP cache aging failed.

2016-11-24 Thread Hannes Frederic Sowa
On 24.11.2016 10:06, YueHaibing wrote:
> On 2016/11/24 15:51, Julian Anastasov wrote:
>>
>>  Hello,
>>
>> On Wed, 23 Nov 2016, Eric Dumazet wrote:
>>
>>> On Wed, 2016-11-23 at 15:37 +0100, Hannes Frederic Sowa wrote:
>>>
 Irregardless about the question if bonding should keep the MAC address
 alive, a MAC address can certainly change below a TCP connection.
>>>
>>> Of course ;)
>>>
> 
> I configured bonding fail_over_mac=1 ,so the bonding MAC always be the MAC
> address of the currently active slave.
> 

 dst_entry is 1:n to neigh_entry and as such we can end up confirming an
 aging neighbor while sending a reply with dst->pending_confirm set while
 the confirming packet actually came from a different neighbor.

 I agree with Julian, pending_confirm became useless in this way.
>>>
>>> Let's kill it then ;)
>>
>>  It works for traffic via gateway. I now see that
>> we can even avoid write in dst_confirm:
>>
>>  if (!dst->pending_confirm)
>>  dst->pending_confirm = 1;
>>
>>  because it is called by non-dup TCP ACKs.
>>
>>  But for traffic to hosts on LAN we need different solution,
>> i.e. for cached dsts with rt_gateway = 0 (last entry below).
>>
>> rt_uses_gateway rt_gateway DST_NOCACHE Description
>> 
>> 1   nh_gw  ANY Traffic via gateway
>> 0   LAN_host   1   FLOWI_FLAG_KNOWN_NH (nexthop
>>set by IPVS, hdrincl, xt_TEE)
>> 0   0  0   1 dst for many subnet hosts
>>
>> Regards
>>
>> --
>> Julian Anastasov 
>>
>> .
>>
> 
> As above,Is there a plan to fix the problem ? Should we just not call 
> dst_confirm
> when in the case rt->rt_uses_gateway/DST_NOCACHE?

I think some people are thinking about it already (me also ;) ).

But it is not easy to come up with a solution. First of all, we need to
look up the L2 address again in the neighbor cache and confirm the
appropriate neighbor. Secondly we should only do that for packets which
we can actually confirm (that means passing the TCP recv tests or some
other kind of confirmation besides simply spamming the box etc). Also it
needs to be fast.

Bye,
Hannes



Re: net/arp: ARP cache aging failed.

2016-11-24 Thread YueHaibing
On 2016/11/24 15:51, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Wed, 23 Nov 2016, Eric Dumazet wrote:
> 
>> On Wed, 2016-11-23 at 15:37 +0100, Hannes Frederic Sowa wrote:
>>
>>> Irregardless about the question if bonding should keep the MAC address
>>> alive, a MAC address can certainly change below a TCP connection.
>>
>> Of course ;)
>>

I configured bonding fail_over_mac=1 ,so the bonding MAC always be the MAC
address of the currently active slave.

>>>
>>> dst_entry is 1:n to neigh_entry and as such we can end up confirming an
>>> aging neighbor while sending a reply with dst->pending_confirm set while
>>> the confirming packet actually came from a different neighbor.
>>>
>>> I agree with Julian, pending_confirm became useless in this way.
>>
>> Let's kill it then ;)
> 
>   It works for traffic via gateway. I now see that
> we can even avoid write in dst_confirm:
> 
>   if (!dst->pending_confirm)
>   dst->pending_confirm = 1;
> 
>   because it is called by non-dup TCP ACKs.
> 
>   But for traffic to hosts on LAN we need different solution,
> i.e. for cached dsts with rt_gateway = 0 (last entry below).
> 
> rt_uses_gateway rt_gateway DST_NOCACHE Description
> 
> 1   nh_gw  ANY Traffic via gateway
> 0   LAN_host   1   FLOWI_FLAG_KNOWN_NH (nexthop
>set by IPVS, hdrincl, xt_TEE)
> 0   0  0   1 dst for many subnet hosts
> 
> Regards
> 
> --
> Julian Anastasov 
> 
> .
> 

As above,Is there a plan to fix the problem ? Should we just not call 
dst_confirm
when in the case rt->rt_uses_gateway/DST_NOCACHE?



Re: net/arp: ARP cache aging failed.

2016-11-23 Thread Julian Anastasov

Hello,

On Wed, 23 Nov 2016, Eric Dumazet wrote:

> On Wed, 2016-11-23 at 15:37 +0100, Hannes Frederic Sowa wrote:
> 
> > Irregardless about the question if bonding should keep the MAC address
> > alive, a MAC address can certainly change below a TCP connection.
> 
> Of course ;)
> 
> > 
> > dst_entry is 1:n to neigh_entry and as such we can end up confirming an
> > aging neighbor while sending a reply with dst->pending_confirm set while
> > the confirming packet actually came from a different neighbor.
> > 
> > I agree with Julian, pending_confirm became useless in this way.
> 
> Let's kill it then ;)

It works for traffic via gateway. I now see that
we can even avoid write in dst_confirm:

if (!dst->pending_confirm)
dst->pending_confirm = 1;

because it is called by non-dup TCP ACKs.

But for traffic to hosts on LAN we need different solution,
i.e. for cached dsts with rt_gateway = 0 (last entry below).

rt_uses_gateway rt_gateway DST_NOCACHE Description

1   nh_gw  ANY Traffic via gateway
0   LAN_host   1   FLOWI_FLAG_KNOWN_NH (nexthop
   set by IPVS, hdrincl, xt_TEE)
0   0  0   1 dst for many subnet hosts

Regards

--
Julian Anastasov 


Re: net/arp: ARP cache aging failed.

2016-11-23 Thread Eric Dumazet
On Wed, 2016-11-23 at 15:37 +0100, Hannes Frederic Sowa wrote:

> Irregardless about the question if bonding should keep the MAC address
> alive, a MAC address can certainly change below a TCP connection.

Of course ;)

> 
> dst_entry is 1:n to neigh_entry and as such we can end up confirming an
> aging neighbor while sending a reply with dst->pending_confirm set while
> the confirming packet actually came from a different neighbor.
> 
> I agree with Julian, pending_confirm became useless in this way.

Let's kill it then ;)




Re: net/arp: ARP cache aging failed.

2016-11-23 Thread Hannes Frederic Sowa
On 23.11.2016 13:05, Eric Dumazet wrote:
> On Wed, 2016-11-23 at 10:33 +0200, Julian Anastasov wrote:
>>  Hello,
>>
>> On Wed, 23 Nov 2016, yuehaibing wrote:
>>
>>> As to my topo,HOST1 and HOST3 share one route on HOST2, tcp connection 
>>> between HOST2 and HOST3 may call tcp_ack to set dst->pending_confirm.
>>> 
>>> So dst_neigh_output may wrongly freshed  n->confirmed which stands for 
>>> HOST1,however HOST1'MAC had been changed.
>>>
>>> The possibility of this occurred Significantly increases ,when ping and 
>>> TCP transaction are set the same processor affinity on the HOST2.
>>>
>>> It seems that the issue is brought in commit 
>>> 5110effee8fde2edfacac9cd12a9960ab2dc39ea ("net: Do delayed neigh 
>>> confirmation.").
>>
>>  Bad news. Problem is not in delayed confirmation but
>> in the mechanism to use same dst for different neighbours on
>> LAN. We don't have a dst->neighbour reference anymore.
>>
>>  For IPv4 this is related to rt->rt_uses_gateway but
>> also to DST_NOCACHE. In the other cases we can not call
>> dst_confirm, may be we should lookup the neigh entry instead.
>> But we need a way to reduce such lookups on every packet,
>> for example, by remembering in struct sock and checking if
>> some bits of jiffies (at least 4-5) are changed from
>> previous lookup.
> 
> 
> I thought bonding would keep the MAC address 'alive'.

I wonder about this, too.

> If TCP packets are confirmed, this means the old MAC address is still
> valid, what am I missing here ?

Irregardless about the question if bonding should keep the MAC address
alive, a MAC address can certainly change below a TCP connection.

dst_entry is 1:n to neigh_entry and as such we can end up confirming an
aging neighbor while sending a reply with dst->pending_confirm set while
the confirming packet actually came from a different neighbor.

I agree with Julian, pending_confirm became useless in this way.

Bye,
Hannes



Re: net/arp: ARP cache aging failed.

2016-11-23 Thread Eric Dumazet
On Wed, 2016-11-23 at 10:33 +0200, Julian Anastasov wrote:
>   Hello,
> 
> On Wed, 23 Nov 2016, yuehaibing wrote:
> 
> > As to my topo,HOST1 and HOST3 share one route on HOST2, tcp connection 
> > between HOST2 and HOST3 may call tcp_ack to set dst->pending_confirm.
> > 
> > So dst_neigh_output may wrongly freshed  n->confirmed which stands for 
> > HOST1,however HOST1'MAC had been changed.
> > 
> > The possibility of this occurred Significantly increases ,when ping and 
> > TCP transaction are set the same processor affinity on the HOST2.
> > 
> > It seems that the issue is brought in commit 
> > 5110effee8fde2edfacac9cd12a9960ab2dc39ea ("net: Do delayed neigh 
> > confirmation.").
> 
>   Bad news. Problem is not in delayed confirmation but
> in the mechanism to use same dst for different neighbours on
> LAN. We don't have a dst->neighbour reference anymore.
> 
>   For IPv4 this is related to rt->rt_uses_gateway but
> also to DST_NOCACHE. In the other cases we can not call
> dst_confirm, may be we should lookup the neigh entry instead.
> But we need a way to reduce such lookups on every packet,
> for example, by remembering in struct sock and checking if
> some bits of jiffies (at least 4-5) are changed from
> previous lookup.


I thought bonding would keep the MAC address 'alive'.

If TCP packets are confirmed, this means the old MAC address is still
valid, what am I missing here ?






Re: net/arp: ARP cache aging failed.

2016-11-23 Thread Julian Anastasov

Hello,

On Wed, 23 Nov 2016, yuehaibing wrote:

>   As to my topo,HOST1 and HOST3 share one route on HOST2, tcp connection 
> between HOST2 and HOST3 may call tcp_ack to set dst->pending_confirm.
>   
> So dst_neigh_output may wrongly freshed  n->confirmed which stands for 
> HOST1,however HOST1'MAC had been changed.
> 
>   The possibility of this occurred Significantly increases ,when ping and 
> TCP transaction are set the same processor affinity on the HOST2.
> 
>   It seems that the issue is brought in commit 
> 5110effee8fde2edfacac9cd12a9960ab2dc39ea ("net: Do delayed neigh 
> confirmation.").

Bad news. Problem is not in delayed confirmation but
in the mechanism to use same dst for different neighbours on
LAN. We don't have a dst->neighbour reference anymore.

For IPv4 this is related to rt->rt_uses_gateway but
also to DST_NOCACHE. In the other cases we can not call
dst_confirm, may be we should lookup the neigh entry instead.
But we need a way to reduce such lookups on every packet,
for example, by remembering in struct sock and checking if
some bits of jiffies (at least 4-5) are changed from
previous lookup.

Regards

--
Julian Anastasov 


net/arp: ARP cache aging failed.

2016-11-22 Thread yuehaibing

 Hi,

I've encountered a arp cache aging failed bug in 4.9 kernel.The topo is 
as follow:


HOST1     -
  IP1 | Switch |IP2-| HOST2 |
 |    -
  --Bonging  |
   |   | IP3
 MAC1MAC2  | HOST3 |


HOST1 have a bonding interface which including two NICs

IP1:192.168.1.100/24
IP2:192.168.1.200/24
IP2:192.168.1.300/24

There are large numbers of TCP transaction between HOST2 and HOST3.


The Host2 can ping HOST1 normally.However,It cannot ping after HOST1 
bonding interface deactived a working NIC.


on HOST2 ,use fowllow command:

watch "ip -s neigh show|grep 192.168.1.100"

I noticed the old HOST1 arp cache aging counter is gradually  increased 
,then reset before it reached 30.This process is repeated,
thus the arp cache holding REACHABLE status.The new HOST1 MAC arp cache cannot 
been renewed ,and thus ping cannot sendto the correct HOST1 MAC.

Then I found n->confirmed is freshed  in dst_neigh_output while 
dst->pending_confirm is set to 1.

include/net/dst.h

static inline void dst_confirm(struct dst_entry *dst)
{
dst->pending_confirm = 1;
}

static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n,
   struct sk_buff *skb)
{
const struct hh_cache *hh;

if (dst->pending_confirm) {
unsigned long now = jiffies;

dst->pending_confirm = 0;
/* avoid dirtying neighbour */
if (n->confirmed != now)
n->confirmed = now;
}
...
}

dst_confirm can be called by tcp_ack in net/ipv4/tcp_input.c.

/* This routine deals with incoming acks, but not outgoing ones. */
static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{
.
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) {
struct dst_entry *dst = __sk_dst_get(sk);
if (dst)
dst_confirm(dst);
}
.
}

As to my topo,HOST1 and HOST3 share one route on HOST2, tcp connection 
between HOST2 and HOST3 may call tcp_ack to set dst->pending_confirm.

So dst_neigh_output may wrongly freshed  n->confirmed which stands for 
HOST1,however HOST1'MAC had been changed.

The possibility of this occurred Significantly increases ,when ping and 
TCP transaction are set the same processor affinity on the HOST2.

It seems that the issue is brought in commit 
5110effee8fde2edfacac9cd12a9960ab2dc39ea ("net: Do delayed neigh 
confirmation.").