[ https://issues.apache.org/jira/browse/TRAFODION-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gonzalo E Correa resolved TRAFODION-2651. ----------------------------------------- Resolution: Fixed > The monitor to monitor process communication cannot handle a network reset > --------------------------------------------------------------------------- > > Key: TRAFODION-2651 > URL: https://issues.apache.org/jira/browse/TRAFODION-2651 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation > Affects Versions: 2.2-incubating > Reporter: Gonzalo E Correa > Assignee: Gonzalo E Correa > Fix For: 2.3 > > > The monitor to monitor socket communication does not have reconnect logic to > handle a network reset or transient network errors. > Analysis: > • During a ~20 second network reset window, no errors are detected by > open sockets > o Open sockets are dead, but there is no indication from the TCP/IP stack > that socket is in an error condition > • Once the network is restored, a CONNECTIONLOSS is reported by the > Zookeeper Client Library. > o However, reconnect logic reestablishes connection with quorum. > • At EPOLL expiration time, EPOLL logic report “Not heard from peer=n” > and treats peer as Node Down. > o The node down logic deletes corresponding znode, > CZClient::WatchNodeDelete() > o All monitor processes continually check for expired znodes for each > node in the cluster, including their own znode > An expired znode is handled as a down node -- This message was sent by Atlassian JIRA (v6.4.14#64029)