Re: [tipc-discussion] [net] tipc: fix false detection of retransmit failures

2019-08-16 Thread David Miller
From: Tuong Lien 
Date: Thu, 15 Aug 2019 10:24:08 +0700

> This commit eliminates the use of the link 'stale_limit' & 'prev_from'
> (besides the already removed - 'stale_cnt') variables in the detection
> of repeated retransmit failures as there is no proper way to initialize
> them to avoid a false detection, i.e. it is not really a retransmission
> failure but due to a garbage values in the variables.
> 
> Instead, a jiffies variable will be added to individual skbs (like the
> way we restrict the skb retransmissions) in order to mark the first skb
> retransmit time. Later on, at the next retransmissions, the timestamp
> will be checked to see if the skb in the link transmq is "too stale",
> that is, the link tolerance time has passed, so that a link reset will
> be ordered. Note, just checking on the first skb in the queue is fine
> enough since it must be the oldest one.
> A counter is also added to keep track the actual skb retransmissions'
> number for later checking when the failure happens.
> 
> The downside of this approach is that the skb->cb[] buffer is about to
> be exhausted, however it is always able to allocate another memory area
> and keep a reference to it when needed.
> 
> Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria")
> Reported-by: Hoang Le 
> Acked-by: Ying Xue 
> Acked-by: Jon Maloy 
> Signed-off-by: Tuong Lien 

Applied, thank you.


___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [net] tipc: fix false detection of retransmit failures

2019-08-02 Thread Ying Xue
Hi Tuong,

Thank you for your clear explanation.

I am fine to this patch. Good work!

Regards,
Ying


On 8/1/19 10:58 AM, Tuong Lien Tong wrote:
> Hi Ying,
> 
> See below my answers inline.
> 
> BR/Tuong
> 
> -Original Message-
> From: Ying Xue  
> Sent: Wednesday, July 31, 2019 8:21 PM
> To: Tuong Lien ; 
> tipc-discussion@lists.sourceforge.net; jon.ma...@ericsson.com; 
> ma...@donjonn.com
> Subject: Re: [net] tipc: fix false detection of retransmit failures
> 
> On 7/19/19 11:56 AM, Tuong Lien wrote:
>> This commit eliminates the use of the link 'stale_limit' & 'prev_from'
>> (besides the already removed - 'stale_cnt') variables in the detection
>> of repeated retransmit failures as there is no proper way to initialize
>> them to avoid a false detection, i.e. it is not really a retransmission
>> failure but due to a garbage values in the variables.
> 
> Sorry, I couldn't understand the reason why we have no proper way to
> initialize 'stale_limit' & 'prev_from' variables of tipc_link struct.
> 
> As far as I understand, the two variables have been set to zero when
> tipc_link object is allocated with kzalloc() in tipc_link_create().
> 
> Can you please help me understand what real scenario we cannot properly
> set them.
> 
> [Tuong]: Yes, these two variables have been initialized to zero at the link 
> create but zero or any other value is not 'safe' for them, the retransmit 
> failure criteria can be met without a real failure. Specifically, the 
> 'time_after()' can return 'true' unexpectedly due to a garbage value (like 
> zero...) of the 'stale_limit' unless it's intentionally set in the 1st 
> condition. However, the 1st condition with the 'prev_from' is not always 
> satisfied. In case the next 'from' is coincidentally identical to the 
> 'prev_from', we will miss it.
> 
>> -if (r->prev_from != from) {
>> -r->prev_from = from;
>> -r->stale_limit = jiffies + msecs_to_jiffies(r->tolerance);
>> -} else if (time_after(jiffies, r->stale_limit)) {
>> -pr_warn("Retransmission failure on link <%s>\n", l->name);
> 
> For example:
> 1) n-th retransmit: (prev_from = x, from = 100)
> ==> ok, we set the variables: prev_from = 100, stale_limit = jiffies + 1.5s
> 2) now, this pkt #100 was retransmitted successfully...
> 3) Later on, n+1-th retransmit: (prev_from = 100, from = 100)
> -> We don’t set the 'stale_limit' but do the “else if” and bump!
> 
> Now, we can try to reset or re-initialize the 'prev_from' somehow, e.g. when 
> the pkt #100 is ack-ed & released, but what value will be for it? Note, any 
> value is a valid sequence number, and if the next 'from' equals that value, 
> we will face with the same trouble again.
> Trying to set the 'stale_limit' to very far in the future is irrelevant too 
> because it turns out that we will disable the feature if the same 'from' is 
> really faced with the repeated retransmit failure!
> 
>>
>> Instead, a jiffies variable will be added to individual skbs (like the
>> way we restrict the skb retransmissions) in order to mark the first skb
>> retransmit time. Later on, at the next retransmissions, the timestamp
>> will be checked to see if the skb in the link transmq is "too stale",
>> that is, the link tolerance time has passed, so that a link reset will
>> be ordered. Note, just checking on the first skb in the queue is fine
>> enough since it must be the oldest one.
>> A counter is also added to keep track the actual skb retransmissions'
>> number for later checking when the failure happens.
>>
>> The downside of this approach is that the skb->cb[] buffer is about to
>> be exhausted, however it is always able to allocate another memory area
>> and keep a reference to it when needed.
>>
>> Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria")
>> Reported-by: Hoang Le 
>> Signed-off-by: Tuong Lien 
>> ---
>>  net/tipc/link.c | 92 
>> -
>>  net/tipc/msg.h  |  8 +++--
>>  2 files changed, 57 insertions(+), 43 deletions(-)
>>
>> diff --git a/net/tipc/link.c b/net/tipc/link.c
>> index 66d3a07bc571..c2c5c53cad22 100644
>> --- a/net/tipc/link.c
>> +++ b/net/tipc/link.c
>> @@ -106,8 +106,6 @@ struct tipc_stats {
>>   * @transmitq: queue for sent, non-acked messages
>>   * @backlogq: queue for messages waiting to be sent
>>   * @snt_nxt: next sequence number to use for outbound messages
>> - * @prev_from: sequence number of most previous retransmission request
>> - * @stale_limit: time when repeated identical retransmits must force link 
>> reset
>>   * @ackers: # of peers that needs to ack each packet before it can be 
>> released
>>   * @acked: # last packet acked by a certain peer. Used for broadcast.
>>   * @rcv_nxt: next sequence number to expect for inbound messages
>> @@ -164,9 +162,7 @@ struct tipc_link {
>>  u16 limit;
>>  } backlog[5];
>>  u16 snd_nxt;
>> -u16 prev_from;
>>  u16 window;
>> -unsigned long 

Re: [tipc-discussion] [net] tipc: fix false detection of retransmit failures

2019-07-31 Thread Tuong Lien Tong
Hi Ying,

See below my answers inline.

BR/Tuong

-Original Message-
From: Ying Xue  
Sent: Wednesday, July 31, 2019 8:21 PM
To: Tuong Lien ; 
tipc-discussion@lists.sourceforge.net; jon.ma...@ericsson.com; ma...@donjonn.com
Subject: Re: [net] tipc: fix false detection of retransmit failures

On 7/19/19 11:56 AM, Tuong Lien wrote:
> This commit eliminates the use of the link 'stale_limit' & 'prev_from'
> (besides the already removed - 'stale_cnt') variables in the detection
> of repeated retransmit failures as there is no proper way to initialize
> them to avoid a false detection, i.e. it is not really a retransmission
> failure but due to a garbage values in the variables.

Sorry, I couldn't understand the reason why we have no proper way to
initialize 'stale_limit' & 'prev_from' variables of tipc_link struct.

As far as I understand, the two variables have been set to zero when
tipc_link object is allocated with kzalloc() in tipc_link_create().

Can you please help me understand what real scenario we cannot properly
set them.

[Tuong]: Yes, these two variables have been initialized to zero at the link 
create but zero or any other value is not 'safe' for them, the retransmit 
failure criteria can be met without a real failure. Specifically, the 
'time_after()' can return 'true' unexpectedly due to a garbage value (like 
zero...) of the 'stale_limit' unless it's intentionally set in the 1st 
condition. However, the 1st condition with the 'prev_from' is not always 
satisfied. In case the next 'from' is coincidentally identical to the 
'prev_from', we will miss it.

> - if (r->prev_from != from) {
> - r->prev_from = from;
> - r->stale_limit = jiffies + msecs_to_jiffies(r->tolerance);
> - } else if (time_after(jiffies, r->stale_limit)) {
> - pr_warn("Retransmission failure on link <%s>\n", l->name);

For example:
1) n-th retransmit: (prev_from = x, from = 100)
==> ok, we set the variables: prev_from = 100, stale_limit = jiffies + 1.5s
2) now, this pkt #100 was retransmitted successfully...
3) Later on, n+1-th retransmit: (prev_from = 100, from = 100)
-> We don’t set the 'stale_limit' but do the “else if” and bump!

Now, we can try to reset or re-initialize the 'prev_from' somehow, e.g. when 
the pkt #100 is ack-ed & released, but what value will be for it? Note, any 
value is a valid sequence number, and if the next 'from' equals that value, we 
will face with the same trouble again.
Trying to set the 'stale_limit' to very far in the future is irrelevant too 
because it turns out that we will disable the feature if the same 'from' is 
really faced with the repeated retransmit failure!

> 
> Instead, a jiffies variable will be added to individual skbs (like the
> way we restrict the skb retransmissions) in order to mark the first skb
> retransmit time. Later on, at the next retransmissions, the timestamp
> will be checked to see if the skb in the link transmq is "too stale",
> that is, the link tolerance time has passed, so that a link reset will
> be ordered. Note, just checking on the first skb in the queue is fine
> enough since it must be the oldest one.
> A counter is also added to keep track the actual skb retransmissions'
> number for later checking when the failure happens.
> 
> The downside of this approach is that the skb->cb[] buffer is about to
> be exhausted, however it is always able to allocate another memory area
> and keep a reference to it when needed.
> 
> Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria")
> Reported-by: Hoang Le 
> Signed-off-by: Tuong Lien 
> ---
>  net/tipc/link.c | 92 
> -
>  net/tipc/msg.h  |  8 +++--
>  2 files changed, 57 insertions(+), 43 deletions(-)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index 66d3a07bc571..c2c5c53cad22 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -106,8 +106,6 @@ struct tipc_stats {
>   * @transmitq: queue for sent, non-acked messages
>   * @backlogq: queue for messages waiting to be sent
>   * @snt_nxt: next sequence number to use for outbound messages
> - * @prev_from: sequence number of most previous retransmission request
> - * @stale_limit: time when repeated identical retransmits must force link 
> reset
>   * @ackers: # of peers that needs to ack each packet before it can be 
> released
>   * @acked: # last packet acked by a certain peer. Used for broadcast.
>   * @rcv_nxt: next sequence number to expect for inbound messages
> @@ -164,9 +162,7 @@ struct tipc_link {
>   u16 limit;
>   } backlog[5];
>   u16 snd_nxt;
> - u16 prev_from;
>   u16 window;
> - unsigned long stale_limit;
>  
>   /* Reception */
>   u16 rcv_nxt;
> @@ -1044,47 +1040,53 @@ static void tipc_link_advance_backlog(struct 
> tipc_link *l,
>   * link_retransmit_failure() - Detect repeated retransmit failures
>   * @l: tipc link sender
>   * @r: tipc link receiver (= l in case of 

Re: [tipc-discussion] [net] tipc: fix false detection of retransmit failures

2019-07-31 Thread Ying Xue
On 7/19/19 11:56 AM, Tuong Lien wrote:
> This commit eliminates the use of the link 'stale_limit' & 'prev_from'
> (besides the already removed - 'stale_cnt') variables in the detection
> of repeated retransmit failures as there is no proper way to initialize
> them to avoid a false detection, i.e. it is not really a retransmission
> failure but due to a garbage values in the variables.

Sorry, I couldn't understand the reason why we have no proper way to
initialize 'stale_limit' & 'prev_from' variables of tipc_link struct.

As far as I understand, the two variables have been set to zero when
tipc_link object is allocated with kzalloc() in tipc_link_create().

Can you please help me understand what real scenario we cannot properly
set them.

> 
> Instead, a jiffies variable will be added to individual skbs (like the
> way we restrict the skb retransmissions) in order to mark the first skb
> retransmit time. Later on, at the next retransmissions, the timestamp
> will be checked to see if the skb in the link transmq is "too stale",
> that is, the link tolerance time has passed, so that a link reset will
> be ordered. Note, just checking on the first skb in the queue is fine
> enough since it must be the oldest one.
> A counter is also added to keep track the actual skb retransmissions'
> number for later checking when the failure happens.
> 
> The downside of this approach is that the skb->cb[] buffer is about to
> be exhausted, however it is always able to allocate another memory area
> and keep a reference to it when needed.
> 
> Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria")
> Reported-by: Hoang Le 
> Signed-off-by: Tuong Lien 
> ---
>  net/tipc/link.c | 92 
> -
>  net/tipc/msg.h  |  8 +++--
>  2 files changed, 57 insertions(+), 43 deletions(-)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index 66d3a07bc571..c2c5c53cad22 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -106,8 +106,6 @@ struct tipc_stats {
>   * @transmitq: queue for sent, non-acked messages
>   * @backlogq: queue for messages waiting to be sent
>   * @snt_nxt: next sequence number to use for outbound messages
> - * @prev_from: sequence number of most previous retransmission request
> - * @stale_limit: time when repeated identical retransmits must force link 
> reset
>   * @ackers: # of peers that needs to ack each packet before it can be 
> released
>   * @acked: # last packet acked by a certain peer. Used for broadcast.
>   * @rcv_nxt: next sequence number to expect for inbound messages
> @@ -164,9 +162,7 @@ struct tipc_link {
>   u16 limit;
>   } backlog[5];
>   u16 snd_nxt;
> - u16 prev_from;
>   u16 window;
> - unsigned long stale_limit;
>  
>   /* Reception */
>   u16 rcv_nxt;
> @@ -1044,47 +1040,53 @@ static void tipc_link_advance_backlog(struct 
> tipc_link *l,
>   * link_retransmit_failure() - Detect repeated retransmit failures
>   * @l: tipc link sender
>   * @r: tipc link receiver (= l in case of unicast)
> - * @from: seqno of the 1st packet in retransmit request
>   * @rc: returned code
>   *
>   * Return: true if the repeated retransmit failures happens, otherwise
>   * false
>   */
>  static bool link_retransmit_failure(struct tipc_link *l, struct tipc_link *r,
> - u16 from, int *rc)
> + int *rc)
>  {
>   struct sk_buff *skb = skb_peek(>transmq);
>   struct tipc_msg *hdr;
>  
>   if (!skb)
>   return false;
> - hdr = buf_msg(skb);
>  
> - /* Detect repeated retransmit failures on same packet */
> - if (r->prev_from != from) {
> - r->prev_from = from;
> - r->stale_limit = jiffies + msecs_to_jiffies(r->tolerance);
> - } else if (time_after(jiffies, r->stale_limit)) {
> - pr_warn("Retransmission failure on link <%s>\n", l->name);
> - link_print(l, "State of link ");
> - pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n",
> - msg_user(hdr), msg_type(hdr), msg_size(hdr),
> - msg_errcode(hdr));
> - pr_info("sqno %u, prev: %x, src: %x\n",
> - msg_seqno(hdr), msg_prevnode(hdr), msg_orignode(hdr));
> -
> - trace_tipc_list_dump(>transmq, true, "retrans failure!");
> - trace_tipc_link_dump(l, TIPC_DUMP_NONE, "retrans failure!");
> - trace_tipc_link_dump(r, TIPC_DUMP_NONE, "retrans failure!");
> + if (!TIPC_SKB_CB(skb)->retr_cnt)
> + return false;
>  
> - if (link_is_bc_sndlink(l))
> - *rc = TIPC_LINK_DOWN_EVT;
> + if (!time_after(jiffies, TIPC_SKB_CB(skb)->retr_stamp +
> + msecs_to_jiffies(r->tolerance)))
> + return false;
> +
> + hdr = buf_msg(skb);
> + if (link_is_bc_sndlink(l) && !less(r->acked, msg_seqno(hdr)))
> + return 

Re: [tipc-discussion] [net] tipc: fix false detection of retransmit failures

2019-07-22 Thread Jon Maloy
Acked-by: Jon Maloy 

Remember to change the prefix to 'net-next' and send it when net-next re-opens.

///jon

> -Original Message-
> From: Tuong Lien 
> Sent: 18-Jul-19 23:57
> To: tipc-discussion@lists.sourceforge.net; Jon Maloy
> ; ma...@donjonn.com; ying@windriver.com
> Subject: [net] tipc: fix false detection of retransmit failures
> 
> This commit eliminates the use of the link 'stale_limit' & 'prev_from'
> (besides the already removed - 'stale_cnt') variables in the detection of
> repeated retransmit failures as there is no proper way to initialize them to
> avoid a false detection, i.e. it is not really a retransmission failure but 
> due to a
> garbage values in the variables.
> 
> Instead, a jiffies variable will be added to individual skbs (like the way we
> restrict the skb retransmissions) in order to mark the first skb retransmit 
> time.
> Later on, at the next retransmissions, the timestamp will be checked to see if
> the skb in the link transmq is "too stale", that is, the link tolerance time 
> has
> passed, so that a link reset will be ordered. Note, just checking on the 
> first skb
> in the queue is fine enough since it must be the oldest one.
> A counter is also added to keep track the actual skb retransmissions'
> number for later checking when the failure happens.
> 
> The downside of this approach is that the skb->cb[] buffer is about to be
> exhausted, however it is always able to allocate another memory area and
> keep a reference to it when needed.
> 
> Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria")
> Reported-by: Hoang Le 
> Signed-off-by: Tuong Lien 
> ---
>  net/tipc/link.c | 92 -
> 
>  net/tipc/msg.h  |  8 +++--
>  2 files changed, 57 insertions(+), 43 deletions(-)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c index
> 66d3a07bc571..c2c5c53cad22 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -106,8 +106,6 @@ struct tipc_stats {
>   * @transmitq: queue for sent, non-acked messages
>   * @backlogq: queue for messages waiting to be sent
>   * @snt_nxt: next sequence number to use for outbound messages
> - * @prev_from: sequence number of most previous retransmission request
> - * @stale_limit: time when repeated identical retransmits must force link
> reset
>   * @ackers: # of peers that needs to ack each packet before it can be 
> released
>   * @acked: # last packet acked by a certain peer. Used for broadcast.
>   * @rcv_nxt: next sequence number to expect for inbound messages @@ -
> 164,9 +162,7 @@ struct tipc_link {
>   u16 limit;
>   } backlog[5];
>   u16 snd_nxt;
> - u16 prev_from;
>   u16 window;
> - unsigned long stale_limit;
> 
>   /* Reception */
>   u16 rcv_nxt;
> @@ -1044,47 +1040,53 @@ static void tipc_link_advance_backlog(struct
> tipc_link *l,
>   * link_retransmit_failure() - Detect repeated retransmit failures
>   * @l: tipc link sender
>   * @r: tipc link receiver (= l in case of unicast)
> - * @from: seqno of the 1st packet in retransmit request
>   * @rc: returned code
>   *
>   * Return: true if the repeated retransmit failures happens, otherwise
>   * false
>   */
>  static bool link_retransmit_failure(struct tipc_link *l, struct tipc_link *r,
> - u16 from, int *rc)
> + int *rc)
>  {
>   struct sk_buff *skb = skb_peek(>transmq);
>   struct tipc_msg *hdr;
> 
>   if (!skb)
>   return false;
> - hdr = buf_msg(skb);
> 
> - /* Detect repeated retransmit failures on same packet */
> - if (r->prev_from != from) {
> - r->prev_from = from;
> - r->stale_limit = jiffies + msecs_to_jiffies(r->tolerance);
> - } else if (time_after(jiffies, r->stale_limit)) {
> - pr_warn("Retransmission failure on link <%s>\n", l->name);
> - link_print(l, "State of link ");
> - pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n",
> - msg_user(hdr), msg_type(hdr), msg_size(hdr),
> - msg_errcode(hdr));
> - pr_info("sqno %u, prev: %x, src: %x\n",
> - msg_seqno(hdr), msg_prevnode(hdr),
> msg_orignode(hdr));
> -
> - trace_tipc_list_dump(>transmq, true, "retrans failure!");
> - trace_tipc_link_dump(l, TIPC_DUMP_NONE, "retrans failure!");
> - trace_tipc_link_dump(r, TIPC_DUMP_NONE, "retrans failure!");
> + if (!TIPC_SKB_CB(skb)->retr_cnt)
> + return false;
> 
> - if (link_is_bc_sndlink(l))
> - *rc = TIPC_LINK_DOWN_EVT;
> + if (!time_after(jiffies, TIPC_SKB_CB(skb)->retr_stamp +
> + msecs_to_jiffies(r->tolerance)))
> + return false;
> +
> + hdr = buf_msg(skb);
> + if (link_is_bc_sndlink(l) && !less(r->acked, msg_seqno(hdr)))
> + return false;
> 
> + pr_warn("Retransmission