[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382641#comment-16382641
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user asfgit closed the pull request at:

https://github.com/apache/trafodion/pull/1457


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381254#comment-16381254
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171424414
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -1484,6 +1676,53 @@ void CZClient::WatchCluster( void )
 TRACE_EXIT;
 }
 
+int CZClient::WatchMasterNode( const char *nodeName )
+{
+const char method_name[] = "CZClient::WatchMasterNode";
+TRACE_ENTRY;
+
--- End diff --

Not needed in this context as it would be a programmer bonehead with 
painful consequences.


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381248#comment-16381248
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171423535
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -1524,6 +1763,108 @@ int CZClient::WatchNode( const char *nodeName )
 return(rc);
 }
 
+int CZClient::WatchNodeMasterDelete( const char *nodeName )
+{
+const char method_name[] = "CZClient::WatchMasterDelete";
+TRACE_ENTRY;
+
+int rc = -1;
+stringstream newpath;
+newpath.str( "" );
+newpath << zkRootNode_.c_str() 
+<< zkRootNodeInstance_.c_str() 
+<< ZCLIENT_MASTER_ZNODE
+<< nodeName;
+   
+string monZnode = newpath.str( );
+
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d zoo_delete(%s)\n"
+, method_name, __LINE__
+, monZnode.c_str() );
+}
+   
+rc = zoo_delete( ZHandle
+   , monZnode.c_str( )
+   , -1 );
+if ( rc == ZOK )
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete 
deleted %s, with rc == ZOK\n"
+, method_name, __LINE__
+, nodeName );
+}
+char buf[MON_STRING_BUF_SIZE];
+snprintf( buf, sizeof(buf)
+, "[%s], znode (%s) deleted!\n"
+, method_name, nodeName );
+mon_log_write(MON_ZCLIENT_WATCHMASTERNODEDELETE_1, SQ_LOG_INFO, 
buf);
+}
+else if ( rc == ZNONODE )
+{
+// This is fine since we call it indiscriminately
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete 
deleted %s, with rc == ZNONODE (fine)\n"
+, method_name, __LINE__
+, nodeName );
+}
+}
+else if ( rc == ZCONNECTIONLOSS || 
+  rc == ZOPERATIONTIMEOUT )
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete 
deleted %s, with rc == ZOK\n"
+, method_name, __LINE__
+, nodeName );
+}
+rc = ZOK;
+char buf[MON_STRING_BUF_SIZE];
+snprintf( buf, sizeof(buf)
+, "[%s], znode (%s) already deleted or cannot be 
accessed!\n"
+, method_name, nodeName );
+mon_log_write(MON_ZCLIENT_WATCHMASTERNODEDELETE_2, SQ_LOG_INFO, 
buf);
+}
+else
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete 
deleted %s, with rc == ZOK\n"
--- End diff --

Yes, better to report actual error.


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381247#comment-16381247
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171423349
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -1524,6 +1763,108 @@ int CZClient::WatchNode( const char *nodeName )
 return(rc);
 }
 
+int CZClient::WatchNodeMasterDelete( const char *nodeName )
+{
+const char method_name[] = "CZClient::WatchMasterDelete";
+TRACE_ENTRY;
+
+int rc = -1;
+stringstream newpath;
+newpath.str( "" );
+newpath << zkRootNode_.c_str() 
+<< zkRootNodeInstance_.c_str() 
+<< ZCLIENT_MASTER_ZNODE
+<< nodeName;
+   
+string monZnode = newpath.str( );
+
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d zoo_delete(%s)\n"
+, method_name, __LINE__
+, monZnode.c_str() );
+}
+   
+rc = zoo_delete( ZHandle
+   , monZnode.c_str( )
+   , -1 );
+if ( rc == ZOK )
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete 
deleted %s, with rc == ZOK\n"
+, method_name, __LINE__
+, nodeName );
+}
+char buf[MON_STRING_BUF_SIZE];
+snprintf( buf, sizeof(buf)
+, "[%s], znode (%s) deleted!\n"
+, method_name, nodeName );
+mon_log_write(MON_ZCLIENT_WATCHMASTERNODEDELETE_1, SQ_LOG_INFO, 
buf);
+}
+else if ( rc == ZNONODE )
+{
+// This is fine since we call it indiscriminately
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete 
deleted %s, with rc == ZNONODE (fine)\n"
+, method_name, __LINE__
+, nodeName );
+}
+}
+else if ( rc == ZCONNECTIONLOSS || 
+  rc == ZOPERATIONTIMEOUT )
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete 
deleted %s, with rc == ZOK\n"
--- End diff --

Yes, better to report actual error.


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381226#comment-16381226
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171420418
  
--- Diff: core/sqf/src/trafconf/clusterconf.cpp ---
@@ -373,6 +376,13 @@ bool CClusterConfig::LoadNodeConfig( void )
 for (int i =0; i < nodeCount; i++ )
 {
 ProcessLNode( nodeConfigData[i], pnodeConfigInfo, lnodeConfigInfo 
);
+// We want to pick the first configured node so all monitors pick 
the same one
+// This only comes into play for a Trafodion start from scratch
+if (i == 0)
+{
+configMaster_ = pnodeConfigInfo.pnid;
+strcpy (configMasterName_ ,pnodeConfigInfo.nodename);
--- End diff --

Doesn't hurt!


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381221#comment-16381221
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171419726
  
--- Diff: core/sqf/monitor/linux/mlio.cxx ---
@@ -1261,7 +1261,13 @@ SQ_LocalIOToClient::SQ_LocalIOToClient(int nid)
   if (cmid == -1)
   {
   if (trace_settings & TRACE_INIT)
- trace_printf("%s@%d" " failed shmget("  "%d" "), errno="  "%d" 
"\n", method_name, __LINE__, (shsize), errno);
+  {
+  int err = errno;
+  char la_buf[MON_STRING_BUF_SIZE];
--- End diff --

yep


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381205#comment-16381205
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user narendragoyal commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171351676
  
--- Diff: core/sqf/monitor/linux/mlio.cxx ---
@@ -1261,7 +1261,13 @@ SQ_LocalIOToClient::SQ_LocalIOToClient(int nid)
   if (cmid == -1)
   {
   if (trace_settings & TRACE_INIT)
- trace_printf("%s@%d" " failed shmget("  "%d" "), errno="  "%d" 
"\n", method_name, __LINE__, (shsize), errno);
+  {
+  int err = errno;
+  char la_buf[MON_STRING_BUF_SIZE];
--- End diff --

la_buf not being used in this block


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381209#comment-16381209
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user narendragoyal commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171353312
  
--- Diff: core/sqf/src/trafconf/clusterconf.cpp ---
@@ -373,6 +376,13 @@ bool CClusterConfig::LoadNodeConfig( void )
 for (int i =0; i < nodeCount; i++ )
 {
 ProcessLNode( nodeConfigData[i], pnodeConfigInfo, lnodeConfigInfo 
);
+// We want to pick the first configured node so all monitors pick 
the same one
+// This only comes into play for a Trafodion start from scratch
+if (i == 0)
+{
+configMaster_ = pnodeConfigInfo.pnid;
+strcpy (configMasterName_ ,pnodeConfigInfo.nodename);
--- End diff --

if necessary, could limit the copy to the length of the destination


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381211#comment-16381211
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user narendragoyal commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171365789
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -1484,6 +1676,53 @@ void CZClient::WatchCluster( void )
 TRACE_EXIT;
 }
 
+int CZClient::WatchMasterNode( const char *nodeName )
+{
+const char method_name[] = "CZClient::WatchMasterNode";
+TRACE_ENTRY;
+
--- End diff --

check for empty/null nodeName just in case  :)? 


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381206#comment-16381206
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user narendragoyal commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171387433
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -488,6 +488,103 @@ int CZClient::ZooExistRetry(zhandle_t *zh, const char 
*path, int watch, struct S
 return rc;
 }
 
+const char* CZClient::WaitForAndReturnMaster( bool doWait )
+{
+const char method_name[] = "CZClient::WaitForAndReturnMaster";
+TRACE_ENTRY;
+
+bool found = false;
+int rc = -1;
+int retries = 0;
+Stat stat;
+
+struct String_vector nodes = {0, NULL};
+stringstream ss;
+ss.str( "" );
+ss << zkRootNode_.c_str() 
+   << zkRootNodeInstance_.c_str() 
+   << ZCLIENT_MASTER_ZNODE;
+string masterMonitor( ss.str( ) );
+
+// wait for 3 minutes for giving up.  
+while ( (!found) && (retries < 180)) 
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d trafCluster=%s\n"
+, method_name, __LINE__, masterMonitor.c_str() );
+}
+// Verify the existence of the parent ZCLIENT_MASTER_ZNODE
+rc = ZooExistRetry( ZHandle, masterMonitor.c_str( ), 0,  );
+
+if ( rc == ZNONODE )
+{
+if (doWait == false)
+{
+break;
+} 
+continue;
+}
+else if ( rc == ZOK )
+{
+// Now get the list of available znodes in the cluster.
+//
+// This will return child znodes for each monitor process that 
has
+// registered, including this process.
+rc = zoo_get_children( ZHandle, masterMonitor.c_str( ), 0, 
 );
+if ( nodes.count > 0 )
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d nodes.count=%d\n"
+, method_name, __LINE__
+, nodes.count );
+}
+found = true;
+}
+else
+{
+if (doWait == false)
+{
+break;
+}
+usleep(100); // sleep for a second as to not overwhelm 
the system   
+   retries++;
+continue;
+}
+}
+ 
+else  // error
+{ 
+   if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d Error (MasterMonitor) 
WaitForAndReturnMaster returned rc (%d), retries %d\n"
--- End diff --

I think we don't need the 'WaitForAndReturnMaster' in the string - the 
method_name being printed already has it


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381207#comment-16381207
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user narendragoyal commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171365500
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -1524,6 +1763,108 @@ int CZClient::WatchNode( const char *nodeName )
 return(rc);
 }
 
+int CZClient::WatchNodeMasterDelete( const char *nodeName )
+{
+const char method_name[] = "CZClient::WatchMasterDelete";
+TRACE_ENTRY;
+
--- End diff --

Not sure if we need a check for empty/null nodeName here.


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381210#comment-16381210
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user narendragoyal commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171387644
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -488,6 +488,103 @@ int CZClient::ZooExistRetry(zhandle_t *zh, const char 
*path, int watch, struct S
 return rc;
 }
 
+const char* CZClient::WaitForAndReturnMaster( bool doWait )
+{
+const char method_name[] = "CZClient::WaitForAndReturnMaster";
+TRACE_ENTRY;
+
+bool found = false;
+int rc = -1;
+int retries = 0;
+Stat stat;
+
+struct String_vector nodes = {0, NULL};
+stringstream ss;
+ss.str( "" );
+ss << zkRootNode_.c_str() 
+   << zkRootNodeInstance_.c_str() 
+   << ZCLIENT_MASTER_ZNODE;
+string masterMonitor( ss.str( ) );
+
+// wait for 3 minutes for giving up.  
+while ( (!found) && (retries < 180)) 
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d trafCluster=%s\n"
+, method_name, __LINE__, masterMonitor.c_str() );
+}
+// Verify the existence of the parent ZCLIENT_MASTER_ZNODE
+rc = ZooExistRetry( ZHandle, masterMonitor.c_str( ), 0,  );
+
+if ( rc == ZNONODE )
+{
+if (doWait == false)
+{
+break;
+} 
+continue;
+}
+else if ( rc == ZOK )
+{
+// Now get the list of available znodes in the cluster.
+//
+// This will return child znodes for each monitor process that 
has
+// registered, including this process.
+rc = zoo_get_children( ZHandle, masterMonitor.c_str( ), 0, 
 );
+if ( nodes.count > 0 )
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d nodes.count=%d\n"
+, method_name, __LINE__
+, nodes.count );
+}
+found = true;
+}
+else
+{
+if (doWait == false)
+{
+break;
+}
+usleep(100); // sleep for a second as to not overwhelm 
the system   
+   retries++;
+continue;
+}
+}
+ 
+else  // error
+{ 
+   if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d Error (MasterMonitor) 
WaitForAndReturnMaster returned rc (%d), retries %d\n"
+, method_name, __LINE__, rc, retries );
+}
+char buf[MON_STRING_BUF_SIZE];
+snprintf( buf, sizeof(buf)
+, "[%s], ZooExistRetry() for %s failed with error %s\n"
+,  method_name, masterMonitor.c_str( ), zerror(rc));
+mon_log_write(MON_ZCLIENT_WAITFORANDRETURNMASTER, SQ_LOG_ERR, 
buf);
+break;
+}
+}
+ 
+//should we assert nodes.count == 1?
+if (found)
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d (MasterMonitor) Master Monitor found 
(%s)\n"
+, method_name, __LINE__, masterMonitor.c_str() );
+}
+return nodes.data[0];
--- End diff --

TRACE_EXIT here?


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug 

[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381136#comment-16381136
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171408186
  
--- Diff: core/sqf/monitor/linux/cluster.cxx ---
@@ -352,6 +358,131 @@ void CCluster::NodeReady( CNode *spareNode )
 TRACE_EXIT;
 }
 
+// Assign leaders as required
+// Current leaders are TM Leader and Monitor Leader
+void CCluster::AssignLeaders( int pnid, bool checkProcess )
+{
+const char method_name[] = "CCluster::AssignLeaders";
+TRACE_ENTRY;
+
+AssignTmLeader ( pnid, checkProcess );
+AssignMonitorLeader ( pnid );
+
+TRACE_EXIT;
+}
+
+// Assign montior lead in the case of failure
+void CCluster::AssignMonitorLeader( int pnid )
+{
+const char method_name[] = "CCluster::AssignMonitorLeader";
+TRACE_ENTRY;
+ 
+int i = 0;
+int rc = 0;
+
+int lMonitorLeaderPNid = MonitorLeaderPNid;
+CNode *node = NULL;
+
+if (MonitorLeaderPNid != pnid) 
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY | TRACE_REQUEST 
| TRACE_SYNC | TRACE_TMSYNC))
+{
+trace_printf( "%s@%d" " - (MasterMonitor) returning, pnid %d 
!= monitorLead %d\n"
+, method_name, __LINE__, pnid, MonitorLeaderPNid );
+}
+ return;
--- End diff --

Yes, good catch!


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381135#comment-16381135
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171408158
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -799,6 +896,67 @@ bool CZClient::IsZNodeExpired( const char *nodeName, 
int  )
 return( expired );
 }
 
+int CZClient::CreateMasterZNode(  const char *nodeName )
+{
+const char method_name[] = "CZClient::CreateMasterZNode";
+TRACE_ENTRY;
+
+int rc;
+int retries = 0;
+
+stringstream masterpath;
+masterpath.str( "" );
+masterpath << zkRootNode_.c_str() 
+<< zkRootNodeInstance_.c_str() 
+<< ZCLIENT_MASTER_ZNODE<< "/"
+<< nodeName;
+
+string monZnode = masterpath.str( );
+
+stringstream ss;
+ss.str( "" );
+ss < Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381134#comment-16381134
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171408008
  
--- Diff: core/sqf/monitor/linux/cluster.h ---
@@ -229,6 +235,7 @@ class CCluster
 int configPNodesMax_;   // max # of physical nodes that can be 
configured
 int*NodeMap;// Mapping of Node ranks to COMM_WORLD ranks
 int TmLeaderNid;// Nid of currently assigned TM Leader node
+int MonitorLeaderPNid; // PNid of currently assigned Monitor 
leader node
--- End diff --

Yes, we do. Noted these anomalies in the code this morning and will change 
this to the followed convention. Thanks for pointing this out.


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380732#comment-16380732
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user DaveBirdsall commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171323042
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -488,6 +488,103 @@ int CZClient::ZooExistRetry(zhandle_t *zh, const char 
*path, int watch, struct S
 return rc;
 }
 
+const char* CZClient::WaitForAndReturnMaster( bool doWait )
+{
+const char method_name[] = "CZClient::WaitForAndReturnMaster";
+TRACE_ENTRY;
+
+bool found = false;
+int rc = -1;
+int retries = 0;
+Stat stat;
+
+struct String_vector nodes = {0, NULL};
+stringstream ss;
+ss.str( "" );
+ss << zkRootNode_.c_str() 
+   << zkRootNodeInstance_.c_str() 
+   << ZCLIENT_MASTER_ZNODE;
+string masterMonitor( ss.str( ) );
+
+// wait for 3 minutes for giving up.  
+while ( (!found) && (retries < 180)) 
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+{
+trace_printf( "%s@%d trafCluster=%s\n"
+, method_name, __LINE__, masterMonitor.c_str() );
+}
+// Verify the existence of the parent ZCLIENT_MASTER_ZNODE
+rc = ZooExistRetry( ZHandle, masterMonitor.c_str( ), 0,  );
+
+if ( rc == ZNONODE )
+{
+if (doWait == false)
+{
+break;
+} 
+continue;
--- End diff --

Should we sleep in this path? Otherwise we seem to be in a spinning 
situation?


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380731#comment-16380731
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user DaveBirdsall commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171324209
  
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -799,6 +896,67 @@ bool CZClient::IsZNodeExpired( const char *nodeName, 
int  )
 return( expired );
 }
 
+int CZClient::CreateMasterZNode(  const char *nodeName )
+{
+const char method_name[] = "CZClient::CreateMasterZNode";
+TRACE_ENTRY;
+
+int rc;
+int retries = 0;
+
+stringstream masterpath;
+masterpath.str( "" );
+masterpath << zkRootNode_.c_str() 
+<< zkRootNodeInstance_.c_str() 
+<< ZCLIENT_MASTER_ZNODE<< "/"
+<< nodeName;
+
+string monZnode = masterpath.str( );
+
+stringstream ss;
+ss.str( "" );
+ss < Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380733#comment-16380733
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user DaveBirdsall commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171325354
  
--- Diff: core/sqf/monitor/linux/cluster.cxx ---
@@ -352,6 +358,131 @@ void CCluster::NodeReady( CNode *spareNode )
 TRACE_EXIT;
 }
 
+// Assign leaders as required
+// Current leaders are TM Leader and Monitor Leader
+void CCluster::AssignLeaders( int pnid, bool checkProcess )
+{
+const char method_name[] = "CCluster::AssignLeaders";
+TRACE_ENTRY;
+
+AssignTmLeader ( pnid, checkProcess );
+AssignMonitorLeader ( pnid );
+
+TRACE_EXIT;
+}
+
+// Assign montior lead in the case of failure
+void CCluster::AssignMonitorLeader( int pnid )
+{
+const char method_name[] = "CCluster::AssignMonitorLeader";
+TRACE_ENTRY;
+ 
+int i = 0;
+int rc = 0;
+
+int lMonitorLeaderPNid = MonitorLeaderPNid;
+CNode *node = NULL;
+
+if (MonitorLeaderPNid != pnid) 
+{
+if (trace_settings & (TRACE_INIT | TRACE_RECOVERY | TRACE_REQUEST 
| TRACE_SYNC | TRACE_TMSYNC))
+{
+trace_printf( "%s@%d" " - (MasterMonitor) returning, pnid %d 
!= monitorLead %d\n"
+, method_name, __LINE__, pnid, MonitorLeaderPNid );
+}
+ return;
--- End diff --

Should there be a TRACE_EXIT before this return?


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380734#comment-16380734
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

Github user DaveBirdsall commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1457#discussion_r171320342
  
--- Diff: core/sqf/monitor/linux/cluster.h ---
@@ -229,6 +235,7 @@ class CCluster
 int configPNodesMax_;   // max # of physical nodes that can be 
configured
 int*NodeMap;// Mapping of Node ranks to COMM_WORLD ranks
 int TmLeaderNid;// Nid of currently assigned TM Leader node
+int MonitorLeaderPNid; // PNid of currently assigned Monitor 
leader node
--- End diff --

In much of the SQL code, the convention is that member names end in an 
underscore. Do we follow that convention in the monitor?


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-27 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379654#comment-16379654
 ] 

Gonzalo E Correa commented on TRAFODION-2883:
-

I have created the pull request which will need a code review before it can be 
merged.

[https://github.com/apache/trafodion/pull/1457]

These changes include the ability to run the monitor processes in AGENT mode 
from a Python installation plus several other scale related changes and bug 
fixes.

To enable AGENT mode, uncomment the following environment variables in 
sqenvcom.sh and copy to all nodes.
{panel:title=sqenvcom.sh }
# Monitor process creator:
 #   MPIRUN - monitor process is created by mpirun
 # Uncomment SQ_MON_CREATOR when running monitor in AGENT mode
 #export SQ_MON_CREATOR=MPIRUN
 
 # Monitor process run mode:
 #   AGENT - monitor process runs in agent mode versus MPI collective
 # Uncomment the three environment variables below
 #export SQ_MON_RUN_MODE=AGENT
 #export MONITOR_COMM_PORT=23399
 #export MONITOR_SYNC_PORT=2339
{panel}
An alternative to the above is to add the following to sql/scripts/shell.env:

SQ_MON_CREATOR=MPIRUN
 SQ_MON_RUN_MODE=AGENT
 MONITOR_COMM_PORT=23399
 MONITOR_SYNC_PORT=23398

With regard to enabling monitor trace when in AGENT mode, use the file in 
sql/scripts/monitor.env and uncomment the trace level desired.

Once this is merged to the baseline, I will merge up these changes to the 
shared TRAFODION-2884 branch in the zcorrea_fork

> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379632#comment-16379632
 ] 

ASF GitHub Bot commented on TRAFODION-2883:
---

GitHub user zcorrea opened a pull request:

https://github.com/apache/trafodion/pull/1457

[TRAFODION-2883] Preliminary Scale Enhancements

- Added timestamps to node down system message
- Added timestamps and values to registry change notifications
- Fixed monitor trace causing memory overwrites
- Implemented AGENT mode monitor functionality
  o This is a pre reliminary change to remove dependency on OpenMPI 
during 
initialization of operational cluster by creating a cluster of one 
node 
(MASTER monitor) where other remote nodes (SLAVE monitors) join the 
cluster through the MASTER
- Implemented MASTER monitor selection logic
- Scale bug fixes found when creating clusters greater than 120 nodes-


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zcorrea/trafodion TRAFODION-2883

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/trafodion/pull/1457.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1457


commit 887051c8a1ba57b0ad47278dce2fc9361a2afb9c
Author: Zalo Correa 
Date:   2018-02-21T00:57:40Z

[TRAFODION-2883] Preliminary Scale Enhacements
- Increased cluster node limit to 1024
- Added timestamps to node down system message
- Added timestamps and values to registry change notifications
- Fixed monitor trace causing memory overwrites

commit bded0e843f8b600a5459c5353bbdc9f59d6d6551
Author: Zalo Correa 
Date:   2018-02-28T01:34:25Z

[TRAFODION-2883] Preliminary Scale Enhacements
- Added timestamps to node down system message
- Added timestamps and values to registry change notifications
- Fixed monitor trace causing memory overwrites
- Implemented AGENT mode monitor functionality
  o This is a pre reliminary change to remove dependency on OpenMPI 
during
initialization of operational cluster by creating a cluster of one 
node
(MASTER monitor) where other remote nodes (SLAVE monitors) join the
cluster through the MASTER
- Implemented MASTER monitor selection logic
- Scale bug fixes found when creating clusters greater than 120 nodes-




> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)