[jira] [Commented] (TRAFODION-2881) Multiple node failures occur during HA testing

2018-01-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324361#comment-16324361
 ] 

ASF GitHub Bot commented on TRAFODION-2881:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1392#discussion_r161296519
  
--- Diff: core/sqf/monitor/linux/cmsh.cxx ---
@@ -128,31 +168,97 @@ int CCmsh::GetClusterState( PhysicalNodeNameMap_t 
 )
 if (it != physicalNodeMap.end())
 {
// TEST_POINT and Exclude List : to force state down on 
node name 
-   const char *downNodeName = getenv( TP001_NODE_DOWN );
-   const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
-  string downNodeString = " ";
-  if (downNodeList)
-  {
-downNodeString += downNodeList;
-downNodeString += " ";
-  }
-  string downNodeToFind = " ";
-  downNodeToFind += nodeName.c_str();
-  downNodeToFind += " ";
-   if (((downNodeList != NULL) && 
- strstr(downNodeString.c_str(),downNodeToFind.c_str())) ||
-   ( (downNodeName != NULL) && 
- !strcmp( downNodeName, nodeName.c_str()) ))
-  {
-   nodeState = StateDown;
-  }
-  
+const char *downNodeName = getenv( TP001_NODE_DOWN );
+const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
+string downNodeString = " ";
+if (downNodeList)
+{
+downNodeString += downNodeList;
+downNodeString += " ";
+}
+string downNodeToFind = " ";
+downNodeToFind += nodeName.c_str();
+downNodeToFind += " ";
+if (((downNodeList != NULL) && 
+  
strstr(downNodeString.c_str(),downNodeToFind.c_str())) ||
+((downNodeName != NULL) && 
+ !strcmp(downNodeName, nodeName.c_str(
+{
+nodeState = StateDown;
+}
+  
 // Set physical node state
 physicalNode = it->second;
 physicalNode->SetState( nodeState );
 }
 }
-}  
+}  
+
+TRACE_EXIT;
+return( rc );
+}
+

+///
+//
+// Function/Method: CCmsh::GetNodeState
+//
+// Description: Updates the state of the nodeName in the physicalNode 
passed in
+//  as a parameter. Caller should ensure that the node names 
are already
+//  present in the physicalNodeMap. 
+//
+// Return:
+//0 - success
+//   -1 - failure
+//

+///
+int CCmsh::GetNodeState( char *name ,CPhysicalNode  *physicalNode )
+{
+const char method_name[] = "CCmsh::GetNodeState";
+TRACE_ENTRY;
+
+int rc;
+
+rc = PopulateNodeState( name );
+
+if ( rc != -1 )
+{
+// Parse each line extracting name and state
+string nodeName;
+NodeState_t nodeState;
+PhysicalNodeNameMap_t::iterator it;
+
+StringList_t::iteratoralit;
+for ( alit = nodeStateList_.begin(); alit != nodeStateList_.end() 
; alit++ )
+{
+ParseNodeStatus( *alit, nodeName, nodeState );
+
+// TEST_POINT and Exclude List : to force state down on node 
name 
+const char *downNodeName = getenv( TP001_NODE_DOWN );
+const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
--- End diff --

Created JIRA (TRAFODION-2907) to clean up the use of TRAF_EXCLUDE_LIST in 
the monitor code.


> Multiple node failures occur during HA testing
> --
>
> Key: TRAFODION-2881
> URL: https://issues.apache.org/jira/browse/TRAFODION-2881
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> Inflicting server failure in certain modes will cause multiple monitor 
> process to also bring their nodes down along with the intended target of the 
> test.
> Server down modes:
> init 6
> reboot -f
> shutdown -r now
> shell node down command
> In 

[jira] [Commented] (TRAFODION-2881) Multiple node failures occur during HA testing

2018-01-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324353#comment-16324353
 ] 

ASF GitHub Bot commented on TRAFODION-2881:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1392#discussion_r161294809
  
--- Diff: core/sqf/monitor/linux/cmsh.cxx ---
@@ -128,31 +168,97 @@ int CCmsh::GetClusterState( PhysicalNodeNameMap_t 
 )
 if (it != physicalNodeMap.end())
 {
// TEST_POINT and Exclude List : to force state down on 
node name 
-   const char *downNodeName = getenv( TP001_NODE_DOWN );
-   const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
-  string downNodeString = " ";
-  if (downNodeList)
-  {
-downNodeString += downNodeList;
-downNodeString += " ";
-  }
-  string downNodeToFind = " ";
-  downNodeToFind += nodeName.c_str();
-  downNodeToFind += " ";
-   if (((downNodeList != NULL) && 
- strstr(downNodeString.c_str(),downNodeToFind.c_str())) ||
-   ( (downNodeName != NULL) && 
- !strcmp( downNodeName, nodeName.c_str()) ))
-  {
-   nodeState = StateDown;
-  }
-  
+const char *downNodeName = getenv( TP001_NODE_DOWN );
+const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
+string downNodeString = " ";
+if (downNodeList)
+{
+downNodeString += downNodeList;
+downNodeString += " ";
+}
+string downNodeToFind = " ";
+downNodeToFind += nodeName.c_str();
+downNodeToFind += " ";
+if (((downNodeList != NULL) && 
+  
strstr(downNodeString.c_str(),downNodeToFind.c_str())) ||
+((downNodeName != NULL) && 
+ !strcmp(downNodeName, nodeName.c_str(
+{
+nodeState = StateDown;
+}
+  
 // Set physical node state
 physicalNode = it->second;
 physicalNode->SetState( nodeState );
 }
 }
-}  
+}  
+
+TRACE_EXIT;
+return( rc );
+}
+

+///
+//
+// Function/Method: CCmsh::GetNodeState
+//
+// Description: Updates the state of the nodeName in the physicalNode 
passed in
+//  as a parameter. Caller should ensure that the node names 
are already
+//  present in the physicalNodeMap. 
+//
+// Return:
+//0 - success
+//   -1 - failure
+//

+///
+int CCmsh::GetNodeState( char *name ,CPhysicalNode  *physicalNode )
+{
+const char method_name[] = "CCmsh::GetNodeState";
+TRACE_ENTRY;
+
+int rc;
+
+rc = PopulateNodeState( name );
+
+if ( rc != -1 )
+{
+// Parse each line extracting name and state
+string nodeName;
+NodeState_t nodeState;
+PhysicalNodeNameMap_t::iterator it;
+
+StringList_t::iteratoralit;
+for ( alit = nodeStateList_.begin(); alit != nodeStateList_.end() 
; alit++ )
+{
+ParseNodeStatus( *alit, nodeName, nodeState );
+
+// TEST_POINT and Exclude List : to force state down on node 
name 
+const char *downNodeName = getenv( TP001_NODE_DOWN );
+const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
--- End diff --

Yes, it is! Good catch!


> Multiple node failures occur during HA testing
> --
>
> Key: TRAFODION-2881
> URL: https://issues.apache.org/jira/browse/TRAFODION-2881
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> Inflicting server failure in certain modes will cause multiple monitor 
> process to also bring their nodes down along with the intended target of the 
> test.
> Server down modes:
> init 6
> reboot -f
> shutdown -r now
> shell node down command
> In addition, after a server down, the shell 'node up' command will also 

[jira] [Commented] (TRAFODION-2881) Multiple node failures occur during HA testing

2018-01-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324252#comment-16324252
 ] 

ASF GitHub Bot commented on TRAFODION-2881:
---

Github user trinakrug commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1392#discussion_r161276162
  
--- Diff: core/sqf/monitor/linux/cmsh.cxx ---
@@ -128,31 +168,97 @@ int CCmsh::GetClusterState( PhysicalNodeNameMap_t 
 )
 if (it != physicalNodeMap.end())
 {
// TEST_POINT and Exclude List : to force state down on 
node name 
-   const char *downNodeName = getenv( TP001_NODE_DOWN );
-   const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
-  string downNodeString = " ";
-  if (downNodeList)
-  {
-downNodeString += downNodeList;
-downNodeString += " ";
-  }
-  string downNodeToFind = " ";
-  downNodeToFind += nodeName.c_str();
-  downNodeToFind += " ";
-   if (((downNodeList != NULL) && 
- strstr(downNodeString.c_str(),downNodeToFind.c_str())) ||
-   ( (downNodeName != NULL) && 
- !strcmp( downNodeName, nodeName.c_str()) ))
-  {
-   nodeState = StateDown;
-  }
-  
+const char *downNodeName = getenv( TP001_NODE_DOWN );
+const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
+string downNodeString = " ";
+if (downNodeList)
+{
+downNodeString += downNodeList;
+downNodeString += " ";
+}
+string downNodeToFind = " ";
+downNodeToFind += nodeName.c_str();
+downNodeToFind += " ";
+if (((downNodeList != NULL) && 
+  
strstr(downNodeString.c_str(),downNodeToFind.c_str())) ||
+((downNodeName != NULL) && 
+ !strcmp(downNodeName, nodeName.c_str(
+{
+nodeState = StateDown;
+}
+  
 // Set physical node state
 physicalNode = it->second;
 physicalNode->SetState( nodeState );
 }
 }
-}  
+}  
+
+TRACE_EXIT;
+return( rc );
+}
+

+///
+//
+// Function/Method: CCmsh::GetNodeState
+//
+// Description: Updates the state of the nodeName in the physicalNode 
passed in
+//  as a parameter. Caller should ensure that the node names 
are already
+//  present in the physicalNodeMap. 
+//
+// Return:
+//0 - success
+//   -1 - failure
+//

+///
+int CCmsh::GetNodeState( char *name ,CPhysicalNode  *physicalNode )
+{
+const char method_name[] = "CCmsh::GetNodeState";
+TRACE_ENTRY;
+
+int rc;
+
+rc = PopulateNodeState( name );
+
+if ( rc != -1 )
+{
+// Parse each line extracting name and state
+string nodeName;
+NodeState_t nodeState;
+PhysicalNodeNameMap_t::iterator it;
+
+StringList_t::iteratoralit;
+for ( alit = nodeStateList_.begin(); alit != nodeStateList_.end() 
; alit++ )
+{
+ParseNodeStatus( *alit, nodeName, nodeState );
+
+// TEST_POINT and Exclude List : to force state down on node 
name 
+const char *downNodeName = getenv( TP001_NODE_DOWN );
+const char *downNodeList = getenv( TRAF_EXCLUDE_LIST );
--- End diff --

Is TRAF_EXCLUDE_LIST obsolete?


> Multiple node failures occur during HA testing
> --
>
> Key: TRAFODION-2881
> URL: https://issues.apache.org/jira/browse/TRAFODION-2881
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> Inflicting server failure in certain modes will cause multiple monitor 
> process to also bring their nodes down along with the intended target of the 
> test.
> Server down modes:
> init 6
> reboot -f
> shutdown -r now
> shell node down command
> In addition, after a server down, the shell 'node up' command 

[jira] [Commented] (TRAFODION-2881) Multiple node failures occur during HA testing

2018-01-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16322606#comment-16322606
 ] 

ASF GitHub Bot commented on TRAFODION-2881:
---

GitHub user zcorrea opened a pull request:

https://github.com/apache/trafodion/pull/1392

[TRAFODION-2881] HA fixes

Fixed multiple problems in monitor Allgather() socket reconnect logic.
- Separated node down detection logic from communication errors and timeouts
  to better handle multiple failure scenarios
- Better handling network resets
- Additional trace information
- Fixed 'node up' hang in monitor shell due to TmSync race condition

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zcorrea/trafodion TRAFODION-2881

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/trafodion/pull/1392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1392


commit e832d827507521998567d4cc5d92e4239007d19a
Author: Zalo Correa 
Date:   2018-01-11T17:32:11Z

[TRAFODION-2881] HA fixes
Fixed multiple problems in monitor Allgather() socket reconnect logic.
- Separated node down detection logic from communication errors and timeouts
  to better handle multiple failure scenarios
- Better handling network resets
- Additional trace information
- Fixed 'node up' hang in monitor shell due to TmSync race condition




> Multiple node failures occur during HA testing
> --
>
> Key: TRAFODION-2881
> URL: https://issues.apache.org/jira/browse/TRAFODION-2881
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> Inflicting server failure in certain modes will cause multiple monitor 
> process to also bring their nodes down along with the intended target of the 
> test.
> Server down modes:
> init 6
> reboot -f
> shutdown -r now
> shell node down command
> In addition, after a server down, the shell 'node up' command will also fail 
> intermittently. This requires a longevity HA test to down and up nodes over a 
> long period of time like 24-48 hours.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)