[jira] [Commented] (TRAFODION-2881) Multiple node failures occur during HA testing
[ https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324361#comment-16324361 ] ASF GitHub Bot commented on TRAFODION-2881: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1392#discussion_r161296519 --- Diff: core/sqf/monitor/linux/cmsh.cxx --- @@ -128,31 +168,97 @@ int CCmsh::GetClusterState( PhysicalNodeNameMap_t ) if (it != physicalNodeMap.end()) { // TEST_POINT and Exclude List : to force state down on node name - const char *downNodeName = getenv( TP001_NODE_DOWN ); - const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); - string downNodeString = " "; - if (downNodeList) - { -downNodeString += downNodeList; -downNodeString += " "; - } - string downNodeToFind = " "; - downNodeToFind += nodeName.c_str(); - downNodeToFind += " "; - if (((downNodeList != NULL) && - strstr(downNodeString.c_str(),downNodeToFind.c_str())) || - ( (downNodeName != NULL) && - !strcmp( downNodeName, nodeName.c_str()) )) - { - nodeState = StateDown; - } - +const char *downNodeName = getenv( TP001_NODE_DOWN ); +const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); +string downNodeString = " "; +if (downNodeList) +{ +downNodeString += downNodeList; +downNodeString += " "; +} +string downNodeToFind = " "; +downNodeToFind += nodeName.c_str(); +downNodeToFind += " "; +if (((downNodeList != NULL) && + strstr(downNodeString.c_str(),downNodeToFind.c_str())) || +((downNodeName != NULL) && + !strcmp(downNodeName, nodeName.c_str( +{ +nodeState = StateDown; +} + // Set physical node state physicalNode = it->second; physicalNode->SetState( nodeState ); } } -} +} + +TRACE_EXIT; +return( rc ); +} + +/// +// +// Function/Method: CCmsh::GetNodeState +// +// Description: Updates the state of the nodeName in the physicalNode passed in +// as a parameter. Caller should ensure that the node names are already +// present in the physicalNodeMap. +// +// Return: +//0 - success +// -1 - failure +// +/// +int CCmsh::GetNodeState( char *name ,CPhysicalNode *physicalNode ) +{ +const char method_name[] = "CCmsh::GetNodeState"; +TRACE_ENTRY; + +int rc; + +rc = PopulateNodeState( name ); + +if ( rc != -1 ) +{ +// Parse each line extracting name and state +string nodeName; +NodeState_t nodeState; +PhysicalNodeNameMap_t::iterator it; + +StringList_t::iteratoralit; +for ( alit = nodeStateList_.begin(); alit != nodeStateList_.end() ; alit++ ) +{ +ParseNodeStatus( *alit, nodeName, nodeState ); + +// TEST_POINT and Exclude List : to force state down on node name +const char *downNodeName = getenv( TP001_NODE_DOWN ); +const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); --- End diff -- Created JIRA (TRAFODION-2907) to clean up the use of TRAF_EXCLUDE_LIST in the monitor code. > Multiple node failures occur during HA testing > -- > > Key: TRAFODION-2881 > URL: https://issues.apache.org/jira/browse/TRAFODION-2881 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa > Fix For: 2.3 > > > Inflicting server failure in certain modes will cause multiple monitor > process to also bring their nodes down along with the intended target of the > test. > Server down modes: > init 6 > reboot -f > shutdown -r now > shell node down command > In
[jira] [Commented] (TRAFODION-2881) Multiple node failures occur during HA testing
[ https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324353#comment-16324353 ] ASF GitHub Bot commented on TRAFODION-2881: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1392#discussion_r161294809 --- Diff: core/sqf/monitor/linux/cmsh.cxx --- @@ -128,31 +168,97 @@ int CCmsh::GetClusterState( PhysicalNodeNameMap_t ) if (it != physicalNodeMap.end()) { // TEST_POINT and Exclude List : to force state down on node name - const char *downNodeName = getenv( TP001_NODE_DOWN ); - const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); - string downNodeString = " "; - if (downNodeList) - { -downNodeString += downNodeList; -downNodeString += " "; - } - string downNodeToFind = " "; - downNodeToFind += nodeName.c_str(); - downNodeToFind += " "; - if (((downNodeList != NULL) && - strstr(downNodeString.c_str(),downNodeToFind.c_str())) || - ( (downNodeName != NULL) && - !strcmp( downNodeName, nodeName.c_str()) )) - { - nodeState = StateDown; - } - +const char *downNodeName = getenv( TP001_NODE_DOWN ); +const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); +string downNodeString = " "; +if (downNodeList) +{ +downNodeString += downNodeList; +downNodeString += " "; +} +string downNodeToFind = " "; +downNodeToFind += nodeName.c_str(); +downNodeToFind += " "; +if (((downNodeList != NULL) && + strstr(downNodeString.c_str(),downNodeToFind.c_str())) || +((downNodeName != NULL) && + !strcmp(downNodeName, nodeName.c_str( +{ +nodeState = StateDown; +} + // Set physical node state physicalNode = it->second; physicalNode->SetState( nodeState ); } } -} +} + +TRACE_EXIT; +return( rc ); +} + +/// +// +// Function/Method: CCmsh::GetNodeState +// +// Description: Updates the state of the nodeName in the physicalNode passed in +// as a parameter. Caller should ensure that the node names are already +// present in the physicalNodeMap. +// +// Return: +//0 - success +// -1 - failure +// +/// +int CCmsh::GetNodeState( char *name ,CPhysicalNode *physicalNode ) +{ +const char method_name[] = "CCmsh::GetNodeState"; +TRACE_ENTRY; + +int rc; + +rc = PopulateNodeState( name ); + +if ( rc != -1 ) +{ +// Parse each line extracting name and state +string nodeName; +NodeState_t nodeState; +PhysicalNodeNameMap_t::iterator it; + +StringList_t::iteratoralit; +for ( alit = nodeStateList_.begin(); alit != nodeStateList_.end() ; alit++ ) +{ +ParseNodeStatus( *alit, nodeName, nodeState ); + +// TEST_POINT and Exclude List : to force state down on node name +const char *downNodeName = getenv( TP001_NODE_DOWN ); +const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); --- End diff -- Yes, it is! Good catch! > Multiple node failures occur during HA testing > -- > > Key: TRAFODION-2881 > URL: https://issues.apache.org/jira/browse/TRAFODION-2881 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa > Fix For: 2.3 > > > Inflicting server failure in certain modes will cause multiple monitor > process to also bring their nodes down along with the intended target of the > test. > Server down modes: > init 6 > reboot -f > shutdown -r now > shell node down command > In addition, after a server down, the shell 'node up' command will also
[jira] [Commented] (TRAFODION-2881) Multiple node failures occur during HA testing
[ https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324252#comment-16324252 ] ASF GitHub Bot commented on TRAFODION-2881: --- Github user trinakrug commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1392#discussion_r161276162 --- Diff: core/sqf/monitor/linux/cmsh.cxx --- @@ -128,31 +168,97 @@ int CCmsh::GetClusterState( PhysicalNodeNameMap_t ) if (it != physicalNodeMap.end()) { // TEST_POINT and Exclude List : to force state down on node name - const char *downNodeName = getenv( TP001_NODE_DOWN ); - const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); - string downNodeString = " "; - if (downNodeList) - { -downNodeString += downNodeList; -downNodeString += " "; - } - string downNodeToFind = " "; - downNodeToFind += nodeName.c_str(); - downNodeToFind += " "; - if (((downNodeList != NULL) && - strstr(downNodeString.c_str(),downNodeToFind.c_str())) || - ( (downNodeName != NULL) && - !strcmp( downNodeName, nodeName.c_str()) )) - { - nodeState = StateDown; - } - +const char *downNodeName = getenv( TP001_NODE_DOWN ); +const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); +string downNodeString = " "; +if (downNodeList) +{ +downNodeString += downNodeList; +downNodeString += " "; +} +string downNodeToFind = " "; +downNodeToFind += nodeName.c_str(); +downNodeToFind += " "; +if (((downNodeList != NULL) && + strstr(downNodeString.c_str(),downNodeToFind.c_str())) || +((downNodeName != NULL) && + !strcmp(downNodeName, nodeName.c_str( +{ +nodeState = StateDown; +} + // Set physical node state physicalNode = it->second; physicalNode->SetState( nodeState ); } } -} +} + +TRACE_EXIT; +return( rc ); +} + +/// +// +// Function/Method: CCmsh::GetNodeState +// +// Description: Updates the state of the nodeName in the physicalNode passed in +// as a parameter. Caller should ensure that the node names are already +// present in the physicalNodeMap. +// +// Return: +//0 - success +// -1 - failure +// +/// +int CCmsh::GetNodeState( char *name ,CPhysicalNode *physicalNode ) +{ +const char method_name[] = "CCmsh::GetNodeState"; +TRACE_ENTRY; + +int rc; + +rc = PopulateNodeState( name ); + +if ( rc != -1 ) +{ +// Parse each line extracting name and state +string nodeName; +NodeState_t nodeState; +PhysicalNodeNameMap_t::iterator it; + +StringList_t::iteratoralit; +for ( alit = nodeStateList_.begin(); alit != nodeStateList_.end() ; alit++ ) +{ +ParseNodeStatus( *alit, nodeName, nodeState ); + +// TEST_POINT and Exclude List : to force state down on node name +const char *downNodeName = getenv( TP001_NODE_DOWN ); +const char *downNodeList = getenv( TRAF_EXCLUDE_LIST ); --- End diff -- Is TRAF_EXCLUDE_LIST obsolete? > Multiple node failures occur during HA testing > -- > > Key: TRAFODION-2881 > URL: https://issues.apache.org/jira/browse/TRAFODION-2881 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa > Fix For: 2.3 > > > Inflicting server failure in certain modes will cause multiple monitor > process to also bring their nodes down along with the intended target of the > test. > Server down modes: > init 6 > reboot -f > shutdown -r now > shell node down command > In addition, after a server down, the shell 'node up' command
[jira] [Commented] (TRAFODION-2881) Multiple node failures occur during HA testing
[ https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16322606#comment-16322606 ] ASF GitHub Bot commented on TRAFODION-2881: --- GitHub user zcorrea opened a pull request: https://github.com/apache/trafodion/pull/1392 [TRAFODION-2881] HA fixes Fixed multiple problems in monitor Allgather() socket reconnect logic. - Separated node down detection logic from communication errors and timeouts to better handle multiple failure scenarios - Better handling network resets - Additional trace information - Fixed 'node up' hang in monitor shell due to TmSync race condition You can merge this pull request into a Git repository by running: $ git pull https://github.com/zcorrea/trafodion TRAFODION-2881 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/trafodion/pull/1392.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1392 commit e832d827507521998567d4cc5d92e4239007d19a Author: Zalo CorreaDate: 2018-01-11T17:32:11Z [TRAFODION-2881] HA fixes Fixed multiple problems in monitor Allgather() socket reconnect logic. - Separated node down detection logic from communication errors and timeouts to better handle multiple failure scenarios - Better handling network resets - Additional trace information - Fixed 'node up' hang in monitor shell due to TmSync race condition > Multiple node failures occur during HA testing > -- > > Key: TRAFODION-2881 > URL: https://issues.apache.org/jira/browse/TRAFODION-2881 > Project: Apache Trafodion > Issue Type: Bug > Components: foundation >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa > Fix For: 2.3 > > > Inflicting server failure in certain modes will cause multiple monitor > process to also bring their nodes down along with the intended target of the > test. > Server down modes: > init 6 > reboot -f > shutdown -r now > shell node down command > In addition, after a server down, the shell 'node up' command will also fail > intermittently. This requires a longevity HA test to down and up nodes over a > long period of time like 24-48 hours. -- This message was sent by Atlassian JIRA (v6.4.14#64029)