Hi Jonas,

Ok , I just pushed , please test once on 4.7 :

============================================================

branch:      opensaf-4.7.x
parent:      8043:4a8a00097561
user:        A V Mahesh <mahesh.va...@oracle.com>
date:        Thu Sep 15 10:50:31 2016 +0530
summary:     dtm: TCP Improve node failFast with TCP_USER_TIMEOUT [#2014]

============================================================

-AVM

On 9/15/2016 12:08 AM, Jonas Arndt wrote:
>
> Mahesh,
>
> Can we get this back-ported to 4.7.x as well?
>
> Cheers,
>
> // Jonas
>
> ------------------------------------------------------------------------
>
> *[tickets:#2014] <https://sourceforge.net/p/opensaf/tickets/2014/> 
> Rebooted controller not detected in TCP*
>
> *Status:* review
> *Milestone:* 5.0.1
> *Created:* Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
> *Last Updated:* Wed Sep 14, 2016 04:51 AM UTC
> *Owner:* A V Mahesh (AVM)
> *Attachments:*
>
>   * logs.tgz
>     <https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz>
>     (84.1 kB; application/x-compressed-tar)
>   * tcp_user_timeout_2014.patch
>     
> <https://sourceforge.net/p/opensaf/tickets/2014/attachment/tcp_user_timeout_2014.patch>
>     (5.5 kB; application/octet-stream)
>
> OS environment:
>
> Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
> 4.4.7 kernel
> Network eth0, bonded, OVS (I have tried all of them and the problem is there 
> in all configurations)
>
> In 20% of the cases a "reboot -f" on controller2 is not detected and 
> acted on. What is in the mds.log is .....
>
> Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: 
> Adest=<0x00000000,1>
> Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV: 
> Anchor=<0x0002020f,1790>
> Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or 
> Error occured
> Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured 
> on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
> Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: 
> Adest=<0x00000000,1>
> Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV: 
> Anchor=<0x0002020f,1790>
> Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or 
> Error occured
> Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured 
> on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
> Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: 
> Adest=<0x00000000,1>
> Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV: 
> Anchor=<0x0002020f,1790>
> Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or 
> Error occured
> Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured 
> on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
> Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: 
> Adest=<0x00000000,1>
> Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV: 
> Anchor=<0x0002020f,1790>
>
> Still, there is nothing in the syslog indicating that controller2 has 
> left the cluster. This is for TCP.
> When the node comes back on line (without opensaf being started) 
> controller 1 notice finally and fail over apps.
>
> When the reboot is not detected the tcp keep alives stops and goes 
> into retransmits instead. I have attached 2 tshark sessions captured 
> from controller1, capturing traffic between controller1 and 
> controller2. The failed reboot detect is captured in 
> "ctrl2_failed_detection.trc" and for a working detection there is a 
> file "ctrl2_working.trc" I have also attached all logs in 
> /var/log/opensaf and the syslog (all from controller one).
>
> It appears to me that we are hitting something similar like 
> "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect";
>
> // Jonas
>
> ------------------------------------------------------------------------
>
> Sent from sourceforge.net because 
> opensaf-tickets@lists.sourceforge.net is subscribed to 
> https://sourceforge.net/p/opensaf/tickets/
>
> To unsubscribe from further messages, a project admin can change 
> settings at https://sourceforge.net/p/opensaf/admin/tickets/options. 
> Or, if this is a mailing list, you can unsubscribe from the mailing list.
>
>
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> Opensaf-tickets mailing list
> Opensaf-tickets@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-tickets




---

** [tickets:#2014] Rebooted controller not detected in TCP**

**Status:** fixed
**Milestone:** 4.7.2
**Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt
**Last Updated:** Thu Sep 15, 2016 05:59 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[logs.tgz](https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz) 
(84.1 kB; application/x-compressed-tar)
- 
[tcp_user_timeout_2014.patch](https://sourceforge.net/p/opensaf/tickets/2014/attachment/tcp_user_timeout_2014.patch)
 (5.5 kB; application/octet-stream)


OS environment:

    Debian Jessie (OpenSAF is running on bare metal, no containers or VMs)
    4.4.7 kernel
    Network eth0, bonded, OVS (I have tried all of them and the problem is 
there in all configurations)


In 20% of the cases a "reboot -f" on  controller2 is not detected and acted on. 
What is in the mds.log is .....

Sep  7  6:44:23.918566 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
Sep  7  6:44:23.918595 osafamfd[41365] ERR  |MDS_SND_RCV: 
Anchor=<0x0002020f,1790>
Sep  7  6:44:34.018662 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout or Error 
occured
Sep  7  6:44:34.018751 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout occured on 
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep  7  6:44:34.018789 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
Sep  7  6:44:34.018818 osafamfd[41365] ERR  |MDS_SND_RCV: 
Anchor=<0x0002020f,1790>
Sep  7  6:44:44.118832 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout or Error 
occured
Sep  7  6:44:44.118919 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout occured on 
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep  7  6:44:44.118955 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
Sep  7  6:44:44.118984 osafamfd[41365] ERR  |MDS_SND_RCV: 
Anchor=<0x0002020f,1790>
Sep  7  6:44:54.218987 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout or Error 
occured
Sep  7  6:44:54.219085 osafamfd[41365] ERR  |MDS_SND_RCV: Timeout occured on 
red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19)
Sep  7  6:44:54.219139 osafamfd[41365] ERR  |MDS_SND_RCV: Adest=<0x00000000,1>
Sep  7  6:44:54.219168 osafamfd[41365] ERR  |MDS_SND_RCV: 
Anchor=<0x0002020f,1790>

Still, there is nothing in the syslog indicating that controller2 has left the 
cluster. This is for TCP.
When the node comes back on line (without opensaf being started) controller 1 
notice finally and fail over apps. 

When the reboot is not detected the tcp keep alives stops and goes into 
retransmits instead. I have attached 2 tshark sessions captured from 
controller1, capturing traffic between controller1 and controller2. The failed 
reboot detect is captured in "ctrl2_failed_detection.trc" and for a working 
detection there is a file "ctrl2_working.trc" I have also attached all logs in 
/var/log/opensaf and the syslog (all from controller one).

It appears to me that we are hitting something similar like 
"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect";

// Jonas


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to