[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382641#comment-16382641 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user asfgit closed the pull request at: https://github.com/apache/trafodion/pull/1457 > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381254#comment-16381254 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171424414 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -1484,6 +1676,53 @@ void CZClient::WatchCluster( void ) TRACE_EXIT; } +int CZClient::WatchMasterNode( const char *nodeName ) +{ +const char method_name[] = "CZClient::WatchMasterNode"; +TRACE_ENTRY; + --- End diff -- Not needed in this context as it would be a programmer bonehead with painful consequences. > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381248#comment-16381248 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171423535 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -1524,6 +1763,108 @@ int CZClient::WatchNode( const char *nodeName ) return(rc); } +int CZClient::WatchNodeMasterDelete( const char *nodeName ) +{ +const char method_name[] = "CZClient::WatchMasterDelete"; +TRACE_ENTRY; + +int rc = -1; +stringstream newpath; +newpath.str( "" ); +newpath << zkRootNode_.c_str() +<< zkRootNodeInstance_.c_str() +<< ZCLIENT_MASTER_ZNODE +<< nodeName; + +string monZnode = newpath.str( ); + +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d zoo_delete(%s)\n" +, method_name, __LINE__ +, monZnode.c_str() ); +} + +rc = zoo_delete( ZHandle + , monZnode.c_str( ) + , -1 ); +if ( rc == ZOK ) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete deleted %s, with rc == ZOK\n" +, method_name, __LINE__ +, nodeName ); +} +char buf[MON_STRING_BUF_SIZE]; +snprintf( buf, sizeof(buf) +, "[%s], znode (%s) deleted!\n" +, method_name, nodeName ); +mon_log_write(MON_ZCLIENT_WATCHMASTERNODEDELETE_1, SQ_LOG_INFO, buf); +} +else if ( rc == ZNONODE ) +{ +// This is fine since we call it indiscriminately +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete deleted %s, with rc == ZNONODE (fine)\n" +, method_name, __LINE__ +, nodeName ); +} +} +else if ( rc == ZCONNECTIONLOSS || + rc == ZOPERATIONTIMEOUT ) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete deleted %s, with rc == ZOK\n" +, method_name, __LINE__ +, nodeName ); +} +rc = ZOK; +char buf[MON_STRING_BUF_SIZE]; +snprintf( buf, sizeof(buf) +, "[%s], znode (%s) already deleted or cannot be accessed!\n" +, method_name, nodeName ); +mon_log_write(MON_ZCLIENT_WATCHMASTERNODEDELETE_2, SQ_LOG_INFO, buf); +} +else +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete deleted %s, with rc == ZOK\n" --- End diff -- Yes, better to report actual error. > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381247#comment-16381247 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171423349 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -1524,6 +1763,108 @@ int CZClient::WatchNode( const char *nodeName ) return(rc); } +int CZClient::WatchNodeMasterDelete( const char *nodeName ) +{ +const char method_name[] = "CZClient::WatchMasterDelete"; +TRACE_ENTRY; + +int rc = -1; +stringstream newpath; +newpath.str( "" ); +newpath << zkRootNode_.c_str() +<< zkRootNodeInstance_.c_str() +<< ZCLIENT_MASTER_ZNODE +<< nodeName; + +string monZnode = newpath.str( ); + +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d zoo_delete(%s)\n" +, method_name, __LINE__ +, monZnode.c_str() ); +} + +rc = zoo_delete( ZHandle + , monZnode.c_str( ) + , -1 ); +if ( rc == ZOK ) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete deleted %s, with rc == ZOK\n" +, method_name, __LINE__ +, nodeName ); +} +char buf[MON_STRING_BUF_SIZE]; +snprintf( buf, sizeof(buf) +, "[%s], znode (%s) deleted!\n" +, method_name, nodeName ); +mon_log_write(MON_ZCLIENT_WATCHMASTERNODEDELETE_1, SQ_LOG_INFO, buf); +} +else if ( rc == ZNONODE ) +{ +// This is fine since we call it indiscriminately +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete deleted %s, with rc == ZNONODE (fine)\n" +, method_name, __LINE__ +, nodeName ); +} +} +else if ( rc == ZCONNECTIONLOSS || + rc == ZOPERATIONTIMEOUT ) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d (MasterMonitor) WatchNodeMasterDelete deleted %s, with rc == ZOK\n" --- End diff -- Yes, better to report actual error. > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381226#comment-16381226 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171420418 --- Diff: core/sqf/src/trafconf/clusterconf.cpp --- @@ -373,6 +376,13 @@ bool CClusterConfig::LoadNodeConfig( void ) for (int i =0; i < nodeCount; i++ ) { ProcessLNode( nodeConfigData[i], pnodeConfigInfo, lnodeConfigInfo ); +// We want to pick the first configured node so all monitors pick the same one +// This only comes into play for a Trafodion start from scratch +if (i == 0) +{ +configMaster_ = pnodeConfigInfo.pnid; +strcpy (configMasterName_ ,pnodeConfigInfo.nodename); --- End diff -- Doesn't hurt! > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381221#comment-16381221 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171419726 --- Diff: core/sqf/monitor/linux/mlio.cxx --- @@ -1261,7 +1261,13 @@ SQ_LocalIOToClient::SQ_LocalIOToClient(int nid) if (cmid == -1) { if (trace_settings & TRACE_INIT) - trace_printf("%s@%d" " failed shmget(" "%d" "), errno=" "%d" "\n", method_name, __LINE__, (shsize), errno); + { + int err = errno; + char la_buf[MON_STRING_BUF_SIZE]; --- End diff -- yep > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381205#comment-16381205 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user narendragoyal commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171351676 --- Diff: core/sqf/monitor/linux/mlio.cxx --- @@ -1261,7 +1261,13 @@ SQ_LocalIOToClient::SQ_LocalIOToClient(int nid) if (cmid == -1) { if (trace_settings & TRACE_INIT) - trace_printf("%s@%d" " failed shmget(" "%d" "), errno=" "%d" "\n", method_name, __LINE__, (shsize), errno); + { + int err = errno; + char la_buf[MON_STRING_BUF_SIZE]; --- End diff -- la_buf not being used in this block > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381209#comment-16381209 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user narendragoyal commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171353312 --- Diff: core/sqf/src/trafconf/clusterconf.cpp --- @@ -373,6 +376,13 @@ bool CClusterConfig::LoadNodeConfig( void ) for (int i =0; i < nodeCount; i++ ) { ProcessLNode( nodeConfigData[i], pnodeConfigInfo, lnodeConfigInfo ); +// We want to pick the first configured node so all monitors pick the same one +// This only comes into play for a Trafodion start from scratch +if (i == 0) +{ +configMaster_ = pnodeConfigInfo.pnid; +strcpy (configMasterName_ ,pnodeConfigInfo.nodename); --- End diff -- if necessary, could limit the copy to the length of the destination > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381211#comment-16381211 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user narendragoyal commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171365789 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -1484,6 +1676,53 @@ void CZClient::WatchCluster( void ) TRACE_EXIT; } +int CZClient::WatchMasterNode( const char *nodeName ) +{ +const char method_name[] = "CZClient::WatchMasterNode"; +TRACE_ENTRY; + --- End diff -- check for empty/null nodeName just in case :)? > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381206#comment-16381206 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user narendragoyal commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171387433 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -488,6 +488,103 @@ int CZClient::ZooExistRetry(zhandle_t *zh, const char *path, int watch, struct S return rc; } +const char* CZClient::WaitForAndReturnMaster( bool doWait ) +{ +const char method_name[] = "CZClient::WaitForAndReturnMaster"; +TRACE_ENTRY; + +bool found = false; +int rc = -1; +int retries = 0; +Stat stat; + +struct String_vector nodes = {0, NULL}; +stringstream ss; +ss.str( "" ); +ss << zkRootNode_.c_str() + << zkRootNodeInstance_.c_str() + << ZCLIENT_MASTER_ZNODE; +string masterMonitor( ss.str( ) ); + +// wait for 3 minutes for giving up. +while ( (!found) && (retries < 180)) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d trafCluster=%s\n" +, method_name, __LINE__, masterMonitor.c_str() ); +} +// Verify the existence of the parent ZCLIENT_MASTER_ZNODE +rc = ZooExistRetry( ZHandle, masterMonitor.c_str( ), 0, ); + +if ( rc == ZNONODE ) +{ +if (doWait == false) +{ +break; +} +continue; +} +else if ( rc == ZOK ) +{ +// Now get the list of available znodes in the cluster. +// +// This will return child znodes for each monitor process that has +// registered, including this process. +rc = zoo_get_children( ZHandle, masterMonitor.c_str( ), 0, ); +if ( nodes.count > 0 ) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d nodes.count=%d\n" +, method_name, __LINE__ +, nodes.count ); +} +found = true; +} +else +{ +if (doWait == false) +{ +break; +} +usleep(100); // sleep for a second as to not overwhelm the system + retries++; +continue; +} +} + +else // error +{ + if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d Error (MasterMonitor) WaitForAndReturnMaster returned rc (%d), retries %d\n" --- End diff -- I think we don't need the 'WaitForAndReturnMaster' in the string - the method_name being printed already has it > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381207#comment-16381207 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user narendragoyal commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171365500 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -1524,6 +1763,108 @@ int CZClient::WatchNode( const char *nodeName ) return(rc); } +int CZClient::WatchNodeMasterDelete( const char *nodeName ) +{ +const char method_name[] = "CZClient::WatchMasterDelete"; +TRACE_ENTRY; + --- End diff -- Not sure if we need a check for empty/null nodeName here. > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381210#comment-16381210 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user narendragoyal commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171387644 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -488,6 +488,103 @@ int CZClient::ZooExistRetry(zhandle_t *zh, const char *path, int watch, struct S return rc; } +const char* CZClient::WaitForAndReturnMaster( bool doWait ) +{ +const char method_name[] = "CZClient::WaitForAndReturnMaster"; +TRACE_ENTRY; + +bool found = false; +int rc = -1; +int retries = 0; +Stat stat; + +struct String_vector nodes = {0, NULL}; +stringstream ss; +ss.str( "" ); +ss << zkRootNode_.c_str() + << zkRootNodeInstance_.c_str() + << ZCLIENT_MASTER_ZNODE; +string masterMonitor( ss.str( ) ); + +// wait for 3 minutes for giving up. +while ( (!found) && (retries < 180)) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d trafCluster=%s\n" +, method_name, __LINE__, masterMonitor.c_str() ); +} +// Verify the existence of the parent ZCLIENT_MASTER_ZNODE +rc = ZooExistRetry( ZHandle, masterMonitor.c_str( ), 0, ); + +if ( rc == ZNONODE ) +{ +if (doWait == false) +{ +break; +} +continue; +} +else if ( rc == ZOK ) +{ +// Now get the list of available znodes in the cluster. +// +// This will return child znodes for each monitor process that has +// registered, including this process. +rc = zoo_get_children( ZHandle, masterMonitor.c_str( ), 0, ); +if ( nodes.count > 0 ) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d nodes.count=%d\n" +, method_name, __LINE__ +, nodes.count ); +} +found = true; +} +else +{ +if (doWait == false) +{ +break; +} +usleep(100); // sleep for a second as to not overwhelm the system + retries++; +continue; +} +} + +else // error +{ + if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d Error (MasterMonitor) WaitForAndReturnMaster returned rc (%d), retries %d\n" +, method_name, __LINE__, rc, retries ); +} +char buf[MON_STRING_BUF_SIZE]; +snprintf( buf, sizeof(buf) +, "[%s], ZooExistRetry() for %s failed with error %s\n" +, method_name, masterMonitor.c_str( ), zerror(rc)); +mon_log_write(MON_ZCLIENT_WAITFORANDRETURNMASTER, SQ_LOG_ERR, buf); +break; +} +} + +//should we assert nodes.count == 1? +if (found) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d (MasterMonitor) Master Monitor found (%s)\n" +, method_name, __LINE__, masterMonitor.c_str() ); +} +return nodes.data[0]; --- End diff -- TRACE_EXIT here? > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381136#comment-16381136 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171408186 --- Diff: core/sqf/monitor/linux/cluster.cxx --- @@ -352,6 +358,131 @@ void CCluster::NodeReady( CNode *spareNode ) TRACE_EXIT; } +// Assign leaders as required +// Current leaders are TM Leader and Monitor Leader +void CCluster::AssignLeaders( int pnid, bool checkProcess ) +{ +const char method_name[] = "CCluster::AssignLeaders"; +TRACE_ENTRY; + +AssignTmLeader ( pnid, checkProcess ); +AssignMonitorLeader ( pnid ); + +TRACE_EXIT; +} + +// Assign montior lead in the case of failure +void CCluster::AssignMonitorLeader( int pnid ) +{ +const char method_name[] = "CCluster::AssignMonitorLeader"; +TRACE_ENTRY; + +int i = 0; +int rc = 0; + +int lMonitorLeaderPNid = MonitorLeaderPNid; +CNode *node = NULL; + +if (MonitorLeaderPNid != pnid) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY | TRACE_REQUEST | TRACE_SYNC | TRACE_TMSYNC)) +{ +trace_printf( "%s@%d" " - (MasterMonitor) returning, pnid %d != monitorLead %d\n" +, method_name, __LINE__, pnid, MonitorLeaderPNid ); +} + return; --- End diff -- Yes, good catch! > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381135#comment-16381135 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171408158 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -799,6 +896,67 @@ bool CZClient::IsZNodeExpired( const char *nodeName, int ) return( expired ); } +int CZClient::CreateMasterZNode( const char *nodeName ) +{ +const char method_name[] = "CZClient::CreateMasterZNode"; +TRACE_ENTRY; + +int rc; +int retries = 0; + +stringstream masterpath; +masterpath.str( "" ); +masterpath << zkRootNode_.c_str() +<< zkRootNodeInstance_.c_str() +<< ZCLIENT_MASTER_ZNODE<< "/" +<< nodeName; + +string monZnode = masterpath.str( ); + +stringstream ss; +ss.str( "" ); +ss < Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381134#comment-16381134 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user zcorrea commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171408008 --- Diff: core/sqf/monitor/linux/cluster.h --- @@ -229,6 +235,7 @@ class CCluster int configPNodesMax_; // max # of physical nodes that can be configured int*NodeMap;// Mapping of Node ranks to COMM_WORLD ranks int TmLeaderNid;// Nid of currently assigned TM Leader node +int MonitorLeaderPNid; // PNid of currently assigned Monitor leader node --- End diff -- Yes, we do. Noted these anomalies in the code this morning and will change this to the followed convention. Thanks for pointing this out. > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380732#comment-16380732 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user DaveBirdsall commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171323042 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -488,6 +488,103 @@ int CZClient::ZooExistRetry(zhandle_t *zh, const char *path, int watch, struct S return rc; } +const char* CZClient::WaitForAndReturnMaster( bool doWait ) +{ +const char method_name[] = "CZClient::WaitForAndReturnMaster"; +TRACE_ENTRY; + +bool found = false; +int rc = -1; +int retries = 0; +Stat stat; + +struct String_vector nodes = {0, NULL}; +stringstream ss; +ss.str( "" ); +ss << zkRootNode_.c_str() + << zkRootNodeInstance_.c_str() + << ZCLIENT_MASTER_ZNODE; +string masterMonitor( ss.str( ) ); + +// wait for 3 minutes for giving up. +while ( (!found) && (retries < 180)) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY)) +{ +trace_printf( "%s@%d trafCluster=%s\n" +, method_name, __LINE__, masterMonitor.c_str() ); +} +// Verify the existence of the parent ZCLIENT_MASTER_ZNODE +rc = ZooExistRetry( ZHandle, masterMonitor.c_str( ), 0, ); + +if ( rc == ZNONODE ) +{ +if (doWait == false) +{ +break; +} +continue; --- End diff -- Should we sleep in this path? Otherwise we seem to be in a spinning situation? > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380731#comment-16380731 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user DaveBirdsall commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171324209 --- Diff: core/sqf/monitor/linux/zclient.cxx --- @@ -799,6 +896,67 @@ bool CZClient::IsZNodeExpired( const char *nodeName, int ) return( expired ); } +int CZClient::CreateMasterZNode( const char *nodeName ) +{ +const char method_name[] = "CZClient::CreateMasterZNode"; +TRACE_ENTRY; + +int rc; +int retries = 0; + +stringstream masterpath; +masterpath.str( "" ); +masterpath << zkRootNode_.c_str() +<< zkRootNodeInstance_.c_str() +<< ZCLIENT_MASTER_ZNODE<< "/" +<< nodeName; + +string monZnode = masterpath.str( ); + +stringstream ss; +ss.str( "" ); +ss < Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380733#comment-16380733 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user DaveBirdsall commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171325354 --- Diff: core/sqf/monitor/linux/cluster.cxx --- @@ -352,6 +358,131 @@ void CCluster::NodeReady( CNode *spareNode ) TRACE_EXIT; } +// Assign leaders as required +// Current leaders are TM Leader and Monitor Leader +void CCluster::AssignLeaders( int pnid, bool checkProcess ) +{ +const char method_name[] = "CCluster::AssignLeaders"; +TRACE_ENTRY; + +AssignTmLeader ( pnid, checkProcess ); +AssignMonitorLeader ( pnid ); + +TRACE_EXIT; +} + +// Assign montior lead in the case of failure +void CCluster::AssignMonitorLeader( int pnid ) +{ +const char method_name[] = "CCluster::AssignMonitorLeader"; +TRACE_ENTRY; + +int i = 0; +int rc = 0; + +int lMonitorLeaderPNid = MonitorLeaderPNid; +CNode *node = NULL; + +if (MonitorLeaderPNid != pnid) +{ +if (trace_settings & (TRACE_INIT | TRACE_RECOVERY | TRACE_REQUEST | TRACE_SYNC | TRACE_TMSYNC)) +{ +trace_printf( "%s@%d" " - (MasterMonitor) returning, pnid %d != monitorLead %d\n" +, method_name, __LINE__, pnid, MonitorLeaderPNid ); +} + return; --- End diff -- Should there be a TRACE_EXIT before this return? > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380734#comment-16380734 ] ASF GitHub Bot commented on TRAFODION-2883: --- Github user DaveBirdsall commented on a diff in the pull request: https://github.com/apache/trafodion/pull/1457#discussion_r171320342 --- Diff: core/sqf/monitor/linux/cluster.h --- @@ -229,6 +235,7 @@ class CCluster int configPNodesMax_; // max # of physical nodes that can be configured int*NodeMap;// Mapping of Node ranks to COMM_WORLD ranks int TmLeaderNid;// Nid of currently assigned TM Leader node +int MonitorLeaderPNid; // PNid of currently assigned Monitor leader node --- End diff -- In much of the SQL code, the convention is that member names end in an underscore. Do we follow that convention in the monitor? > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379654#comment-16379654 ] Gonzalo E Correa commented on TRAFODION-2883: - I have created the pull request which will need a code review before it can be merged. [https://github.com/apache/trafodion/pull/1457] These changes include the ability to run the monitor processes in AGENT mode from a Python installation plus several other scale related changes and bug fixes. To enable AGENT mode, uncomment the following environment variables in sqenvcom.sh and copy to all nodes. {panel:title=sqenvcom.sh } # Monitor process creator: # MPIRUN - monitor process is created by mpirun # Uncomment SQ_MON_CREATOR when running monitor in AGENT mode #export SQ_MON_CREATOR=MPIRUN # Monitor process run mode: # AGENT - monitor process runs in agent mode versus MPI collective # Uncomment the three environment variables below #export SQ_MON_RUN_MODE=AGENT #export MONITOR_COMM_PORT=23399 #export MONITOR_SYNC_PORT=2339 {panel} An alternative to the above is to add the following to sql/scripts/shell.env: SQ_MON_CREATOR=MPIRUN SQ_MON_RUN_MODE=AGENT MONITOR_COMM_PORT=23399 MONITOR_SYNC_PORT=23398 With regard to enabling monitor trace when in AGENT mode, use the file in sql/scripts/monitor.env and uncomment the trace level desired. Once this is merged to the baseline, I will merge up these changes to the shared TRAFODION-2884 branch in the zcorrea_fork > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements
[ https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379632#comment-16379632 ] ASF GitHub Bot commented on TRAFODION-2883: --- GitHub user zcorrea opened a pull request: https://github.com/apache/trafodion/pull/1457 [TRAFODION-2883] Preliminary Scale Enhancements - Added timestamps to node down system message - Added timestamps and values to registry change notifications - Fixed monitor trace causing memory overwrites - Implemented AGENT mode monitor functionality o This is a pre reliminary change to remove dependency on OpenMPI during initialization of operational cluster by creating a cluster of one node (MASTER monitor) where other remote nodes (SLAVE monitors) join the cluster through the MASTER - Implemented MASTER monitor selection logic - Scale bug fixes found when creating clusters greater than 120 nodes- You can merge this pull request into a Git repository by running: $ git pull https://github.com/zcorrea/trafodion TRAFODION-2883 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/trafodion/pull/1457.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1457 commit 887051c8a1ba57b0ad47278dce2fc9361a2afb9c Author: Zalo CorreaDate: 2018-02-21T00:57:40Z [TRAFODION-2883] Preliminary Scale Enhacements - Increased cluster node limit to 1024 - Added timestamps to node down system message - Added timestamps and values to registry change notifications - Fixed monitor trace causing memory overwrites commit bded0e843f8b600a5459c5353bbdc9f59d6d6551 Author: Zalo Correa Date: 2018-02-28T01:34:25Z [TRAFODION-2883] Preliminary Scale Enhacements - Added timestamps to node down system message - Added timestamps and values to registry change notifications - Fixed monitor trace causing memory overwrites - Implemented AGENT mode monitor functionality o This is a pre reliminary change to remove dependency on OpenMPI during initialization of operational cluster by creating a cluster of one node (MASTER monitor) where other remote nodes (SLAVE monitors) join the cluster through the MASTER - Implemented MASTER monitor selection logic - Scale bug fixes found when creating clusters greater than 120 nodes- > Preliminary Trafodion Foundation Scalability Enhancements > - > > Key: TRAFODION-2883 > URL: https://issues.apache.org/jira/browse/TRAFODION-2883 > Project: Apache Trafodion > Issue Type: Improvement > Components: dtm, foundation, installer >Affects Versions: 2.3 >Reporter: Gonzalo E Correa >Assignee: Gonzalo E Correa >Priority: Major > Fix For: 2.3 > > > Initial changes required to: > - AGENT mode monitor > o Preliminary change to remove dependency on OpenMPI during > initialization of operational cluster by creating a cluster > of one node (MASTER monitor) where other remote nodes (SLAVE > monitors) join the cluster through the MASTER > - MASTER monitor selection > - Scale bug fixes found when creating clusters greater than 120 nodes -- This message was sent by Atlassian JIRA (v7.6.3#76005)