Hi,

We got a TIPC issue in our product, we are using an old kernel(3.10.38) so I 
think someone maybe already knew this case and can help us on this issue.

Our product is a cluster system, has two controller nodes and several payload 
nodes. We deploy OpenSAF in our system to manage these nodes, via TIPC.

Several days ago we rebooted a payload node in the system, after the reboot the 
payload got a short-time network chip/driver problem and both TIPC and other 
protocol(like TCP) were impacted.
The network recovered immediately, then those programs based on other protocols 
like TCP also recovered quickly, but TIPC did not come back until next reboot.

Below is the case syslog:

1. After the node 'pld0102' rebooted, it succeeded to setup TIPC connetction 
with other nodes.
2018-04-09T09:53:12.705735+08:00 kern.info pld0102 kernel: tipc: Established 
link <1.1.2:bond0-1.1.15:eth2> on network plane A
2018-04-09T09:53:12.705777+08:00 kern.info pld0102 kernel: tipc: Established 
link <1.1.2:bond0-1.1.5:bond0> on network plane A
2018-04-09T09:53:12.705853+08:00 kern.info pld0102 kernel: tipc: Established 
link <1.1.2:bond0-1.1.10:bond0> on network plane A
2018-04-09T09:53:12.706010+08:00 kern.info pld0102 kernel: tipc: Established 
link <1.1.2:bond0-1.1.16:eth2> on network plane A
2018-04-09T09:53:12.706022+08:00 kern.info pld0102 kernel: tipc: Established 
link <1.1.2:bond0-1.1.12:bond0> on network plane A

2. Several minutes after the rebooting, the network chip/driver had a problem 
and recovered immediately, those programs based on TCP/IP protocol were also 
impacted and recovered.
2018-04-09T09:54:28.061865+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
isScmAccessible 100.100.1.15 : succeed, return [Warning: Permanently added 
'100.100.1.15' (ECDSA) to the list of known hosts.#0                            
  15#015#[email protected]'s password:]
2018-04-09T09:54:28.277046+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
isScmAccessible 100.100.1.16 : succeed, return [Warning: Permanently added 
'100.100.1.16' (ECDSA) to the list of known hosts.#0                            
  15#015#[email protected]'s password:]   ===>>> THE PAYLOAD NODE 'pld0102' 
CAN ACCESS THE CONTROLLER NODE
2018-04-09T09:54:28.377690+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
sleep for 20 seconds(failure 0, loop count 1)
2018-04-09T09:54:53.406043+08:00 user.warning pld0102 AutoRecoverReloadFail.py: 
isScmAccessible 100.100.1.15 : fail, return [Error: ]   ===>>> THE PAYLOAD NODE 
'pld0102' COULD NOT ACCESS THE CONTROLLER NODE SUDDENLY.
2018-04-09T09:54:53.908054+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
sleep for 18 seconds(failure 1, loop count 2)
2018-04-09T09:55:12.040157+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
isScmAccessible 100.100.1.15 : succeed, return [Warning: Permanently added 
'100.100.1.15' (ECDSA) to the list of known hosts.#0                            
  15#015#[email protected]'s password:]   ===>>> TCP/IP PROTOCOL RECOVERED 
AND THEN 'pld0102' CAN CONTINUE TO ACCESS THE CONTROLLER NODE
2018-04-09T09:55:12.262501+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
isScmAccessible 100.100.1.16 : succeed, return [Warning: Permanently added 
'100.100.1.16' (ECDSA) to the list of known hosts.#0                            
  15#015#[email protected]'s password:]
2018-04-09T09:55:12.363050+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
sleep for 15 seconds(failure 0, loop count 3)
2018-04-09T09:55:27.510388+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
isScmAccessible 100.100.1.15 : succeed, return [Warning: Permanently added 
'100.100.1.15' (ECDSA) to the list of known hosts.#0                            
  15#015#[email protected]'s password:]
2018-04-09T09:55:27.719778+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
isScmAccessible 100.100.1.16 : succeed, return [Warning: Permanently added 
'100.100.1.16' (ECDSA) to the list of known hosts.#0                            
  15#015#[email protected]'s password:]
2018-04-09T09:55:27.820637+08:00 user.info pld0102 AutoRecoverReloadFail.py: 
sleep for 18 seconds(failure 0, loop count 4)

3. However, TIPC in 'pld0102' also got problems 30 seconds around later and 
lost contact with all other nodes and didn't recover until next reboot(which 
happened at the time 10 minutes later).
2018-04-09T09:55:42.428828+08:00 kern.warning pld0102 kernel: tipc: Resetting 
link <1.1.2:bond0-1.1.5:bond0>, peer not responding
2018-04-09T09:55:42.428879+08:00 kern.info pld0102 kernel: tipc: Lost link 
<1.1.2:bond0-1.1.5:bond0> on network plane A
2018-04-09T09:55:42.428892+08:00 kern.info pld0102 kernel: tipc: Lost contact 
with <1.1.5>
2018-04-09T09:55:42.428904+08:00 kern.warning pld0102 kernel: tipc: Resetting 
link <1.1.2:bond0-1.1.10:bond0>, peer not responding
2018-04-09T09:55:42.428915+08:00 kern.info pld0102 kernel: tipc: Lost link 
<1.1.2:bond0-1.1.10:bond0> on network plane A
2018-04-09T09:55:42.428967+08:00 kern.info pld0102 kernel: tipc: Lost contact 
with <1.1.10>
2018-04-09T09:55:42.428978+08:00 kern.warning pld0102 kernel: tipc: Resetting 
link <1.1.2:bond0-1.1.15:eth2>, peer not responding
2018-04-09T09:55:42.428984+08:00 kern.info pld0102 kernel: tipc: Lost link 
<1.1.2:bond0-1.1.15:eth2> on network plane A
2018-04-09T09:55:42.428991+08:00 kern.info pld0102 kernel: tipc: Lost contact 
with <1.1.15>
2018-04-09T09:55:42.927546+08:00 kern.warning pld0102 kernel: tipc: Resetting 
link <1.1.2:bond0-1.1.16:eth2>, peer not responding
2018-04-09T09:55:42.927607+08:00 kern.info pld0102 kernel: tipc: Lost link 
<1.1.2:bond0-1.1.16:eth2> on network plane A
2018-04-09T09:55:42.927621+08:00 kern.info pld0102 kernel: tipc: Lost contact 
with <1.1.16>


Thanks for any comment, and please let me know if other information is needed.


Regards,
Jianfeng

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to