Thank you Jon, and I think I see a familiar name here :)... Anders already 
helped us a lot on OpenSAF, and we talked about similar cases to this one on 
OpenSAF side in last weeks.



Regards,

Jianfeng



-----Original Message-----
From: Jon Maloy <[email protected]>
Sent: Thursday, April 19, 2018 3:34 AM
To: Jianfeng Dong <[email protected]>; [email protected]
Cc: Anders Widell <[email protected]>
Subject: RE: [tipc-discussion] TIPC did not recover after a short time network 
problem



Hi Jianfeng,

This is a really hard one. The kernel is very old, and the problem does not 
sound familiar to me, as a pure TIPC maintainer.

However, we do have people working with OpenSAF even in our company, so I will 
cc your message to one of our guys, just in case it is something he recognizes.



BR

///jon





> -----Original Message-----

> From: Jianfeng Dong [mailto:[email protected]]

> Sent: Wednesday, April 18, 2018 03:23

> To: 
> [email protected]<mailto:[email protected]>

> Subject: [tipc-discussion] TIPC did not recover after a short time

> network problem

>

> Hi,

>

> We got a TIPC issue in our product, we are using an old

> kernel(3.10.38) so I think someone maybe already knew this case and can help 
> us on this issue.

>

> Our product is a cluster system, has two controller nodes and several

> payload nodes. We deploy OpenSAF in our system to manage these nodes, via 
> TIPC.

>

> Several days ago we rebooted a payload node in the system, after the

> reboot the payload got a short-time network chip/driver problem and

> both TIPC and other protocol(like TCP) were impacted.

> The network recovered immediately, then those programs based on other

> protocols like TCP also recovered quickly, but TIPC did not come back

> until next reboot.

>

> Below is the case syslog:

>

> 1. After the node 'pld0102' rebooted, it succeeded to setup TIPC

> connetction with other nodes.

> 2018-04-09T09:53:12.705735+08:00 kern.info pld0102 kernel: tipc:

> Established link <1.1.2:bond0-1.1.15:eth2> on network plane A

> 2018-04-09T09:53:12.705777+08:00 kern.info pld0102 kernel: tipc:

> Established link <1.1.2:bond0-1.1.5:bond0> on network plane A

> 2018-04-09T09:53:12.705853+08:00 kern.info pld0102 kernel: tipc:

> Established link <1.1.2:bond0-1.1.10:bond0> on network plane A

> 2018-04-09T09:53:12.706010+08:00 kern.info pld0102 kernel: tipc:

> Established link <1.1.2:bond0-1.1.16:eth2> on network plane A

> 2018-04-09T09:53:12.706022+08:00 kern.info pld0102 kernel: tipc:

> Established link <1.1.2:bond0-1.1.12:bond0> on network plane A

>

> 2. Several minutes after the rebooting, the network chip/driver had a

> problem and recovered immediately, those programs based on TCP/IP

> protocol were also impacted and recovered.

> 2018-04-09T09:54:28.061865+08:00 user.info pld0102

> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed,

> return

> [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known

> hosts.#0                              
> 15#015#[email protected]'s<mailto:15#015#[email protected]'s> 
> password:]

> 2018-04-09T09:54:28.277046+08:00 user.info pld0102

> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed,

> return

> [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known

> hosts.#0                              
> 15#015#[email protected]'s<mailto:15#015#[email protected]'s> 
> password:]

> ===>>> THE PAYLOAD NODE 'pld0102' CAN ACCESS THE CONTROLLER NODE

> 2018-04-09T09:54:28.377690+08:00 user.info pld0102

> AutoRecoverReloadFail.py: sleep for 20 seconds(failure 0, loop count

> 1)

> 2018-04-09T09:54:53.406043+08:00 user.warning pld0102

> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : fail, return

> [Error: ] ===>>> THE PAYLOAD NODE 'pld0102' COULD NOT ACCESS THE

> CONTROLLER NODE SUDDENLY.

> 2018-04-09T09:54:53.908054+08:00 user.info pld0102

> AutoRecoverReloadFail.py: sleep for 18 seconds(failure 1, loop count

> 2)

> 2018-04-09T09:55:12.040157+08:00 user.info pld0102

> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed,

> return

> [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known

> hosts.#0                              
> 15#015#[email protected]'s<mailto:15#015#[email protected]'s> 
> password:]

> ===>>> TCP/IP PROTOCOL RECOVERED AND THEN 'pld0102' CAN CONTINUE TO

> ACCESS THE CONTROLLER NODE

> 2018-04-09T09:55:12.262501+08:00 user.info pld0102

> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed,

> return

> [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known

> hosts.#0                              
> 15#015#[email protected]'s<mailto:15#015#[email protected]'s> 
> password:]

> 2018-04-09T09:55:12.363050+08:00 user.info pld0102

> AutoRecoverReloadFail.py: sleep for 15 seconds(failure 0, loop count

> 3)

> 2018-04-09T09:55:27.510388+08:00 user.info pld0102

> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed,

> return

> [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known

> hosts.#0                              
> 15#015#[email protected]'s<mailto:15#015#[email protected]'s> 
> password:]

> 2018-04-09T09:55:27.719778+08:00 user.info pld0102

> AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed,

> return

> [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known

> hosts.#0                              
> 15#015#[email protected]'s<mailto:15#015#[email protected]'s> 
> password:]

> 2018-04-09T09:55:27.820637+08:00 user.info pld0102

> AutoRecoverReloadFail.py: sleep for 18 seconds(failure 0, loop count

> 4)

>

> 3. However, TIPC in 'pld0102' also got problems 30 seconds around

> later and lost contact with all other nodes and didn't recover until

> next reboot(which happened at the time 10 minutes later).

> 2018-04-09T09:55:42.428828+08:00 kern.warning pld0102 kernel: tipc:

> Resetting link <1.1.2:bond0-1.1.5:bond0>, peer not responding

> 2018-04-09T09:55:42.428879+08:00 kern.info pld0102 kernel: tipc: Lost

> link <1.1.2:bond0-1.1.5:bond0> on network plane A

> 2018-04-09T09:55:42.428892+08:00 kern.info pld0102 kernel: tipc: Lost

> contact with <1.1.5>

> 2018-04-09T09:55:42.428904+08:00 kern.warning pld0102 kernel: tipc:

> Resetting link <1.1.2:bond0-1.1.10:bond0>, peer not responding

> 2018-04-09T09:55:42.428915+08:00 kern.info pld0102 kernel: tipc: Lost

> link <1.1.2:bond0-1.1.10:bond0> on network plane A

> 2018-04-09T09:55:42.428967+08:00 kern.info pld0102 kernel: tipc: Lost

> contact with <1.1.10>

> 2018-04-09T09:55:42.428978+08:00 kern.warning pld0102 kernel: tipc:

> Resetting link <1.1.2:bond0-1.1.15:eth2>, peer not responding

> 2018-04-09T09:55:42.428984+08:00 kern.info pld0102 kernel: tipc: Lost

> link <1.1.2:bond0-1.1.15:eth2> on network plane A

> 2018-04-09T09:55:42.428991+08:00 kern.info pld0102 kernel: tipc: Lost

> contact with <1.1.15>

> 2018-04-09T09:55:42.927546+08:00 kern.warning pld0102 kernel: tipc:

> Resetting link <1.1.2:bond0-1.1.16:eth2>, peer not responding

> 2018-04-09T09:55:42.927607+08:00 kern.info pld0102 kernel: tipc: Lost

> link <1.1.2:bond0-1.1.16:eth2> on network plane A

> 2018-04-09T09:55:42.927621+08:00 kern.info pld0102 kernel: tipc: Lost

> contact with <1.1.16>

>

>

> Thanks for any comment, and please let me know if other information is

> needed.

>

>

> Regards,

> Jianfeng

>

> ----------------------------------------------------------------------

> -------- Check out the vibrant tech community on one of the world's

> most engaging tech sites, Slashdot.org!

> https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&;

> d=DwIFAg&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnO

> GUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=buP2CsJvJMUMqwxgDSjWauf7L3-2Cs4SXHs84

> pe9P7w&s=LFUdAozz4ojuKd2kytp_yxrRW2RZky_5uNnIzjStVNw&e=

> _______________________________________________

> tipc-discussion mailing list

> [email protected]<mailto:[email protected]>

> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge

> .net_lists_listinfo_tipc-2Ddiscussion&d=DwIFAg&c=HAkYuh63rsuhr6Scbfh0U

> jBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnOGUNV9RMO0jfx4u8vdegcV0mcKujXlI&m

> =buP2CsJvJMUMqwxgDSjWauf7L3-2Cs4SXHs84pe9P7w&s=cmiVF7ER3tnNFQWSn87veYL

> rA244y7-zNyz3Fsqj9kQ&e=
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to