Thank you Jon, and I think I see a familiar name here :)... Anders already helped us a lot on OpenSAF, and we talked about similar cases to this one on OpenSAF side in last weeks.
Regards, Jianfeng -----Original Message----- From: Jon Maloy <[email protected]> Sent: Thursday, April 19, 2018 3:34 AM To: Jianfeng Dong <[email protected]>; [email protected] Cc: Anders Widell <[email protected]> Subject: RE: [tipc-discussion] TIPC did not recover after a short time network problem Hi Jianfeng, This is a really hard one. The kernel is very old, and the problem does not sound familiar to me, as a pure TIPC maintainer. However, we do have people working with OpenSAF even in our company, so I will cc your message to one of our guys, just in case it is something he recognizes. BR ///jon > -----Original Message----- > From: Jianfeng Dong [mailto:[email protected]] > Sent: Wednesday, April 18, 2018 03:23 > To: > [email protected]<mailto:[email protected]> > Subject: [tipc-discussion] TIPC did not recover after a short time > network problem > > Hi, > > We got a TIPC issue in our product, we are using an old > kernel(3.10.38) so I think someone maybe already knew this case and can help > us on this issue. > > Our product is a cluster system, has two controller nodes and several > payload nodes. We deploy OpenSAF in our system to manage these nodes, via > TIPC. > > Several days ago we rebooted a payload node in the system, after the > reboot the payload got a short-time network chip/driver problem and > both TIPC and other protocol(like TCP) were impacted. > The network recovered immediately, then those programs based on other > protocols like TCP also recovered quickly, but TIPC did not come back > until next reboot. > > Below is the case syslog: > > 1. After the node 'pld0102' rebooted, it succeeded to setup TIPC > connetction with other nodes. > 2018-04-09T09:53:12.705735+08:00 kern.info pld0102 kernel: tipc: > Established link <1.1.2:bond0-1.1.15:eth2> on network plane A > 2018-04-09T09:53:12.705777+08:00 kern.info pld0102 kernel: tipc: > Established link <1.1.2:bond0-1.1.5:bond0> on network plane A > 2018-04-09T09:53:12.705853+08:00 kern.info pld0102 kernel: tipc: > Established link <1.1.2:bond0-1.1.10:bond0> on network plane A > 2018-04-09T09:53:12.706010+08:00 kern.info pld0102 kernel: tipc: > Established link <1.1.2:bond0-1.1.16:eth2> on network plane A > 2018-04-09T09:53:12.706022+08:00 kern.info pld0102 kernel: tipc: > Established link <1.1.2:bond0-1.1.12:bond0> on network plane A > > 2. Several minutes after the rebooting, the network chip/driver had a > problem and recovered immediately, those programs based on TCP/IP > protocol were also impacted and recovered. > 2018-04-09T09:54:28.061865+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, > return > [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known > hosts.#0 > 15#015#[email protected]'s<mailto:15#015#[email protected]'s> > password:] > 2018-04-09T09:54:28.277046+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, > return > [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known > hosts.#0 > 15#015#[email protected]'s<mailto:15#015#[email protected]'s> > password:] > ===>>> THE PAYLOAD NODE 'pld0102' CAN ACCESS THE CONTROLLER NODE > 2018-04-09T09:54:28.377690+08:00 user.info pld0102 > AutoRecoverReloadFail.py: sleep for 20 seconds(failure 0, loop count > 1) > 2018-04-09T09:54:53.406043+08:00 user.warning pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : fail, return > [Error: ] ===>>> THE PAYLOAD NODE 'pld0102' COULD NOT ACCESS THE > CONTROLLER NODE SUDDENLY. > 2018-04-09T09:54:53.908054+08:00 user.info pld0102 > AutoRecoverReloadFail.py: sleep for 18 seconds(failure 1, loop count > 2) > 2018-04-09T09:55:12.040157+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, > return > [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known > hosts.#0 > 15#015#[email protected]'s<mailto:15#015#[email protected]'s> > password:] > ===>>> TCP/IP PROTOCOL RECOVERED AND THEN 'pld0102' CAN CONTINUE TO > ACCESS THE CONTROLLER NODE > 2018-04-09T09:55:12.262501+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, > return > [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known > hosts.#0 > 15#015#[email protected]'s<mailto:15#015#[email protected]'s> > password:] > 2018-04-09T09:55:12.363050+08:00 user.info pld0102 > AutoRecoverReloadFail.py: sleep for 15 seconds(failure 0, loop count > 3) > 2018-04-09T09:55:27.510388+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.15 : succeed, > return > [Warning: Permanently added '100.100.1.15' (ECDSA) to the list of known > hosts.#0 > 15#015#[email protected]'s<mailto:15#015#[email protected]'s> > password:] > 2018-04-09T09:55:27.719778+08:00 user.info pld0102 > AutoRecoverReloadFail.py: isScmAccessible 100.100.1.16 : succeed, > return > [Warning: Permanently added '100.100.1.16' (ECDSA) to the list of known > hosts.#0 > 15#015#[email protected]'s<mailto:15#015#[email protected]'s> > password:] > 2018-04-09T09:55:27.820637+08:00 user.info pld0102 > AutoRecoverReloadFail.py: sleep for 18 seconds(failure 0, loop count > 4) > > 3. However, TIPC in 'pld0102' also got problems 30 seconds around > later and lost contact with all other nodes and didn't recover until > next reboot(which happened at the time 10 minutes later). > 2018-04-09T09:55:42.428828+08:00 kern.warning pld0102 kernel: tipc: > Resetting link <1.1.2:bond0-1.1.5:bond0>, peer not responding > 2018-04-09T09:55:42.428879+08:00 kern.info pld0102 kernel: tipc: Lost > link <1.1.2:bond0-1.1.5:bond0> on network plane A > 2018-04-09T09:55:42.428892+08:00 kern.info pld0102 kernel: tipc: Lost > contact with <1.1.5> > 2018-04-09T09:55:42.428904+08:00 kern.warning pld0102 kernel: tipc: > Resetting link <1.1.2:bond0-1.1.10:bond0>, peer not responding > 2018-04-09T09:55:42.428915+08:00 kern.info pld0102 kernel: tipc: Lost > link <1.1.2:bond0-1.1.10:bond0> on network plane A > 2018-04-09T09:55:42.428967+08:00 kern.info pld0102 kernel: tipc: Lost > contact with <1.1.10> > 2018-04-09T09:55:42.428978+08:00 kern.warning pld0102 kernel: tipc: > Resetting link <1.1.2:bond0-1.1.15:eth2>, peer not responding > 2018-04-09T09:55:42.428984+08:00 kern.info pld0102 kernel: tipc: Lost > link <1.1.2:bond0-1.1.15:eth2> on network plane A > 2018-04-09T09:55:42.428991+08:00 kern.info pld0102 kernel: tipc: Lost > contact with <1.1.15> > 2018-04-09T09:55:42.927546+08:00 kern.warning pld0102 kernel: tipc: > Resetting link <1.1.2:bond0-1.1.16:eth2>, peer not responding > 2018-04-09T09:55:42.927607+08:00 kern.info pld0102 kernel: tipc: Lost > link <1.1.2:bond0-1.1.16:eth2> on network plane A > 2018-04-09T09:55:42.927621+08:00 kern.info pld0102 kernel: tipc: Lost > contact with <1.1.16> > > > Thanks for any comment, and please let me know if other information is > needed. > > > Regards, > Jianfeng > > ---------------------------------------------------------------------- > -------- Check out the vibrant tech community on one of the world's > most engaging tech sites, Slashdot.org! > https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot& > d=DwIFAg&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnO > GUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=buP2CsJvJMUMqwxgDSjWauf7L3-2Cs4SXHs84 > pe9P7w&s=LFUdAozz4ojuKd2kytp_yxrRW2RZky_5uNnIzjStVNw&e= > _______________________________________________ > tipc-discussion mailing list > [email protected]<mailto:[email protected]> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge > .net_lists_listinfo_tipc-2Ddiscussion&d=DwIFAg&c=HAkYuh63rsuhr6Scbfh0U > jBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnOGUNV9RMO0jfx4u8vdegcV0mcKujXlI&m > =buP2CsJvJMUMqwxgDSjWauf7L3-2Cs4SXHs84pe9P7w&s=cmiVF7ER3tnNFQWSn87veYL > rA244y7-zNyz3Fsqj9kQ&e= ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
