[jira] [Resolved] (TRAFODION-3334) Communication IO between monitor processes must use timeouts and retries

2020-03-04 Thread Gonzalo E Correa (Jira)


 [ 
https://issues.apache.org/jira/browse/TRAFODION-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa resolved TRAFODION-3334.
-
Resolution: Fixed

> Communication IO between monitor processes must use timeouts and retries
> 
>
> Key: TRAFODION-3334
> URL: https://issues.apache.org/jira/browse/TRAFODION-3334
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.4
>Reporter: Gonzalo E Correa
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Most communication channels used by monitor processes to exchange cluster 
> state information and to handle failure detection must be changed to 
> asynchronous IO with timeouts and retries to allow for the removal of a 
> monitor process from the cluster communication. This is to prevent a  'Sync 
> Thread Timeout' failure of the entire cluster instance where a monitor 
> process or it host server becomes unresponsive due to a server or network 
> failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TRAFODION-3334) Communication IO between monitor processes must use timeouts and retries

2020-01-31 Thread Gonzalo E Correa (Jira)
Gonzalo E Correa created TRAFODION-3334:
---

 Summary: Communication IO between monitor processes must use 
timeouts and retries
 Key: TRAFODION-3334
 URL: https://issues.apache.org/jira/browse/TRAFODION-3334
 Project: Apache Trafodion
  Issue Type: Bug
  Components: foundation
Affects Versions: 2.4
Reporter: Gonzalo E Correa
 Fix For: 2.4


Most communication channels used by monitor processes to exchange cluster state 
information and to handle failure detection must be changed to asynchronous IO 
with timeouts and retries to allow for the removal of a monitor process from 
the cluster communication. This is to prevent a  'Sync Thread Timeout' failure 
of the entire cluster instance where a monitor process or it host server 
becomes unresponsive due to a server or network failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TRAFODION-3318) Change process management of DTM to improve HA behavior

2020-01-31 Thread Gonzalo E Correa (Jira)


 [ 
https://issues.apache.org/jira/browse/TRAFODION-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa resolved TRAFODION-3318.
-
Resolution: Fixed

> Change process management of DTM to improve HA behavior
> ---
>
> Key: TRAFODION-3318
> URL: https://issues.apache.org/jira/browse/TRAFODION-3318
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation
>Affects Versions: 2.4
>Reporter: Gonzalo E Correa
>Priority: Major
> Fix For: 2.4
>
>   Original Estimate: 120h
>  Time Spent: 3h
>  Remaining Estimate: 117h
>
> Current process management model for process type DTM enforces and soft node 
> down behavior which kills all processes in a node where a DTM process 
> terminates abnormally. The DTM process is recreated by the monitor along with 
> all persistent processes hosted in that node.
> To reduce the fault zone impact, this change removes the soft node down/up 
> functionality so that the DTM process is recreated without killing all other 
> processes in the node. The rule where the persistent DTM process cannot be 
> restarted within the configured retries in the specified time window will 
> cause a node down will still be enforced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TRAFODION-3318) Change process management of DTM to improve HA behavior

2019-07-24 Thread Gonzalo E Correa (JIRA)


 [ 
https://issues.apache.org/jira/browse/TRAFODION-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-3318:

Affects Version/s: 2.4
Fix Version/s: 2.4

> Change process management of DTM to improve HA behavior
> ---
>
> Key: TRAFODION-3318
> URL: https://issues.apache.org/jira/browse/TRAFODION-3318
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation
>Affects Versions: 2.4
>Reporter: Gonzalo E Correa
>Priority: Major
> Fix For: 2.4
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Current process management model for process type DTM enforces and soft node 
> down behavior which kills all processes in a node where a DTM process 
> terminates abnormally. The DTM process is recreated by the monitor along with 
> all persistent processes hosted in that node.
> To reduce the fault zone impact, this change removes the soft node down/up 
> functionality so that the DTM process is recreated without killing all other 
> processes in the node. The rule where the persistent DTM process cannot be 
> restarted within the configured retries in the specified time window will 
> cause a node down will still be enforced.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (TRAFODION-3318) Change process management of DTM to improve HA behavior

2019-07-24 Thread Gonzalo E Correa (JIRA)


 [ 
https://issues.apache.org/jira/browse/TRAFODION-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-3318:

Summary: Change process management of DTM to improve HA behavior  (was: 
Change process management of DTM improve HA behavior)

> Change process management of DTM to improve HA behavior
> ---
>
> Key: TRAFODION-3318
> URL: https://issues.apache.org/jira/browse/TRAFODION-3318
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation
>Reporter: Gonzalo E Correa
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Current process management model for process type DTM enforces and soft node 
> down behavior which kills all processes in a node where a DTM process 
> terminates abnormally. The DTM process is recreated by the monitor along with 
> all persistent processes hosted in that node.
> To reduce the fault zone impact, this change removes the soft node down/up 
> functionality so that the DTM process is recreated without killing all other 
> processes in the node. The rule where the persistent DTM process cannot be 
> restarted within the configured retries in the specified time window will 
> cause a node down will still be enforced.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (TRAFODION-3318) Change process management of DTM improve HA behavior

2019-07-24 Thread Gonzalo E Correa (JIRA)
Gonzalo E Correa created TRAFODION-3318:
---

 Summary: Change process management of DTM improve HA behavior
 Key: TRAFODION-3318
 URL: https://issues.apache.org/jira/browse/TRAFODION-3318
 Project: Apache Trafodion
  Issue Type: Improvement
  Components: dtm, foundation
Reporter: Gonzalo E Correa


Current process management model for process type DTM enforces and soft node 
down behavior which kills all processes in a node where a DTM process 
terminates abnormally. The DTM process is recreated by the monitor along with 
all persistent processes hosted in that node.

To reduce the fault zone impact, this change removes the soft node down/up 
functionality so that the DTM process is recreated without killing all other 
processes in the node. The rule where the persistent DTM process cannot be 
restarted within the configured retries in the specified time window will cause 
a node down will still be enforced.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (TRAFODION-3275) Log files are no longer generated in monitor created processes

2019-02-14 Thread Gonzalo E Correa (JIRA)


 [ 
https://issues.apache.org/jira/browse/TRAFODION-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa resolved TRAFODION-3275.
-
Resolution: Fixed

|Added propagation of TRAF_LOG environment variable to child processes in 
monitor CProcess::Create().|

> Log files are no longer generated in monitor created processes
> --
>
> Key: TRAFODION-3275
> URL: https://issues.apache.org/jira/browse/TRAFODION-3275
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.4
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While debugging a node down situation, I noticed that the WDG log files no 
> longer get generated in R2.4. It's possible that the TRAF_LOG environment 
> variable is not getting propagated properly to child processes by the monitor 
> process as recent changes have been merged in this area.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-3275) Log files are no longer generated in monitor created processes

2019-02-12 Thread Gonzalo E Correa (JIRA)


[ 
https://issues.apache.org/jira/browse/TRAFODION-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766561#comment-16766561
 ] 

Gonzalo E Correa commented on TRAFODION-3275:
-

|Added propagation of TRAF_LOG environment variable to child processes in 
monitor CProcess::Create().|

> Log files are no longer generated in monitor created processes
> --
>
> Key: TRAFODION-3275
> URL: https://issues.apache.org/jira/browse/TRAFODION-3275
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.4
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.4
>
>
> While debugging a node down situation, I noticed that the WDG log files no 
> longer get generated in R2.4. It's possible that the TRAF_LOG environment 
> variable is not getting propagated properly to child processes by the monitor 
> process as recent changes have been merged in this area.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (TRAFODION-3275) Log files are no longer generated in monitor created processes

2019-02-12 Thread Gonzalo E Correa (JIRA)


 [ 
https://issues.apache.org/jira/browse/TRAFODION-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on TRAFODION-3275 started by Gonzalo E Correa.
---
> Log files are no longer generated in monitor created processes
> --
>
> Key: TRAFODION-3275
> URL: https://issues.apache.org/jira/browse/TRAFODION-3275
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.4
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.4
>
>
> While debugging a node down situation, I noticed that the WDG log files no 
> longer get generated in R2.4. It's possible that the TRAF_LOG environment 
> variable is not getting propagated properly to child processes by the monitor 
> process as recent changes have been merged in this area.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TRAFODION-3275) Log files are no longer generated in monitor created processes

2019-02-12 Thread Gonzalo E Correa (JIRA)
Gonzalo E Correa created TRAFODION-3275:
---

 Summary: Log files are no longer generated in monitor created 
processes
 Key: TRAFODION-3275
 URL: https://issues.apache.org/jira/browse/TRAFODION-3275
 Project: Apache Trafodion
  Issue Type: Bug
  Components: foundation
Affects Versions: 2.4
Reporter: Gonzalo E Correa
Assignee: Gonzalo E Correa
 Fix For: 2.4


While debugging a node down situation, I noticed that the WDG log files no 
longer get generated in R2.4. It's possible that the TRAF_LOG environment 
variable is not getting propagated properly to child processes by the monitor 
process as recent changes have been merged in this area.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-09 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393653#comment-16393653
 ] 

Gonzalo E Correa commented on TRAFODION-2884:
-

We should consider doing what you suggest. There is code in the monitor to use 
the cores specified in the node section today, but by default, all cores are 
used. It can be enabled by environment variable, SQ_USE_CPU_AFFINITY, which is 
is off by default. So your suggestion makes sense. Perhaps a question to the 
community in general would help to get some feedback on this. I can write up 
the original intended use and what we would give up and see if there are any 
unseen repercussion to making this change.

Good suggestion, by the way!

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: 
> TRAFODION-2884-Scalable_Name_Space-Architecure-review.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes-review.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Overview-20180308.pptx
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-09 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392057#comment-16392057
 ] 

Gonzalo E Correa edited comment on TRAFODION-2884 at 3/9/18 9:38 PM:
-

There is no flow chart, but starting a Trafodion instance is as follows:
 # MPIRUN or supervisord creates the monitor process across the configured 
nodes.
 # The monitor process initialization logic determines if it is running in 
AGENT mode or as a MPI collective
 ** There are the two code paths in which the monitor processes determine who 
the cluster members are. That is, the operational view of the cluster composed 
of monitor processes.
 ** We are working on making the AGENT mode the only way the monitor processes 
initialize so that we can remove the MPI collective initialization logic.
 # The monitor process initialization logic determines if the Name Server is 
enabled or not.

I have added an overview document which provides some pictures of monitor 
process creation and initialization activities which you can refer to it as I 
try to answer your other questions.

Question 1:

The monitors processes are created by mpirun in a Python installation and in 
the future by the supervisord in a Cloudera Manager installation. Who creates 
the process is not that important. How the monitor processes join together to 
create a Trafodion cluster is the change which is addressed by the 
MASTER/non-MASTER functionality. I have added logic in the monitor which is 
enabled by environment variable that will cause the monitor to run in AGENT 
mode which uses the MASTER/non-MASTER method of creating a cluster. Meaning 
non-MASTER monitor processes join the cluster through the MASTER monitor 
process. The environment variables will be documented as follows:

 
{quote}NOTE: This is a work in process so it may change!

Monitor process creator:
 #

MPIRUN - monitor process is created by mpirun
 #

Uncomment SQ_MON_CREATOR when running monitor in AGENT mode
 #export SQ_MON_CREATOR=MPIRUN

Monitor process run mode:
 #

AGENT - monitor process runs in agent mode versus MPI collective
 #

Uncomment the next three environment variables
 #export SQ_MON_RUN_MODE=AGENT
 #export MONITOR_COMM_PORT=23399
 #export MONITOR_SYNC_PORT=23398
 #

NAME-SERVER - to disable process replication and enable name-server
 #

Uncomment the next four environment variables
 #export NAMESERVER_ENABLE=1
 #export NS_COMM_PORT=23397
 #export NS_SYNC_PORT=23396
 #export NS_M2N_COMM_PORT=23395

So a Python installation can set the MPIRUN and AGENT environment variables 
which will tell the monitor MPIRUN is the creator and it is to run in AGENT 
mode.

In addition, the Name Server logic can be enabled or disabled. This is separate 
from the monitor executing in AGENT mode.
{quote}
Question 2:

The Name Server processes will always behave as if they were in AGENT mode, 
meaning that the MASTER/non-MASTER method is how they initialize the set. The 
non-MASTER Name Server processes join the set though the MASTER Name Server 
Process.

Question 3:

They don't. The monitor processes at initialization time will create a Name 
Server process as the first 'primitive process if there is a Name Server 
configured to run in its node and connect to it. Otherwise, the monitor process 
will select an existing Name Server process to connect to. All monitor 
processes must connect to a Name Server process when NAMESERVER_ENABLE=1; 
otherwise, the monitor will terminate with an error.


was (Author: zcorrea):
There is no flow chart, but starting a Trafodion instance is as follows:
 # MPIRUN or supervisord creates the monitor process across the configured 
nodes.
 # The monitor process initialization logic determines if it is running in 
AGENT mode or as a MPI collective
 ** There are the two code paths in which the monitor processes determine who 
the cluster members are. That is, the operational view of the cluster composed 
of monitor processes.
 ** We are working on making the AGENT mode the only way the monitor processes 
initialize so that we can remove the MPI collective initialization logic.
 # The monitor process initialization logic determines if the Name Server is 
enabled or not.

I have added an overview document which provides some pictures of monitor 
process creation and initialization activities which you can refer to it as I 
try to answer your other questions.
 # The monitors processes are created by mpirun in a Python installation and in 
the future by the supervisord in a Cloudera Manager installation. Who creates 
the process is not that important. How the monitor processes join together to 
create a Trafodion cluster is the change which is addressed by the 
MASTER/non-MASTER functionality. I have added logic in the monitor which is 
enabled by environment variable that will cause the monitor to run in AGENT 
mode which uses the MASTER/non-MASTER 

[jira] [Comment Edited] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-09 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392057#comment-16392057
 ] 

Gonzalo E Correa edited comment on TRAFODION-2884 at 3/9/18 9:37 PM:
-

There is no flow chart, but starting a Trafodion instance is as follows:
 # MPIRUN or supervisord creates the monitor process across the configured 
nodes.
 # The monitor process initialization logic determines if it is running in 
AGENT mode or as a MPI collective
 ** There are the two code paths in which the monitor processes determine who 
the cluster members are. That is, the operational view of the cluster composed 
of monitor processes.
 ** We are working on making the AGENT mode the only way the monitor processes 
initialize so that we can remove the MPI collective initialization logic.
 # The monitor process initialization logic determines if the Name Server is 
enabled or not.

I have added an overview document which provides some pictures of monitor 
process creation and initialization activities which you can refer to it as I 
try to answer your other questions.
 # The monitors processes are created by mpirun in a Python installation and in 
the future by the supervisord in a Cloudera Manager installation. Who creates 
the process is not that important. How the monitor processes join together to 
create a Trafodion cluster is the change which is addressed by the 
MASTER/non-MASTER functionality. I have added logic in the monitor which is 
enabled by environment variable that will cause the monitor to run in AGENT 
mode which uses the MASTER/non-MASTER method of creating a cluster. Meaning 
non-MASTER monitor processes join the cluster through the MASTER monitor 
process. The environment variables will be documented as follows:

{quote}NOTE: This is a work in process so it may change!

Monitor process creator:
 #

MPIRUN - monitor process is created by mpirun
 #

Uncomment SQ_MON_CREATOR when running monitor in AGENT mode
 #export SQ_MON_CREATOR=MPIRUN

Monitor process run mode:
 #

AGENT - monitor process runs in agent mode versus MPI collective
 #

Uncomment the next three environment variables
 #export SQ_MON_RUN_MODE=AGENT
 #export MONITOR_COMM_PORT=23399
 #export MONITOR_SYNC_PORT=23398
 #

NAME-SERVER - to disable process replication and enable name-server
 #

Uncomment the next four environment variables
 #export NAMESERVER_ENABLE=1
 #export NS_COMM_PORT=23397
 #export NS_SYNC_PORT=23396
 #export NS_M2N_COMM_PORT=23395

So a Python installation can set the MPIRUN and AGENT environment variables 
which will tell the monitor MPIRUN is the creator and it is to run in AGENT 
mode.

In addition, the Name Server logic can be enabled or disabled. This is separate 
from the monitor executing in AGENT mode.
{quote} # The Name Server processes will always behave as if they were in AGENT 
mode, meaning that the MASTER/non-MASTER method is how they initialize the set. 
The non-MASTER Name Server processes join the set though the MASTER Name Server 
Process.
 # They don't. The monitor processes at initialization time will create a Name 
Server process as the first 'primitive process if there is a Name Server 
configured to run in its node and connect to it. Otherwise, the monitor process 
will select an existing Name Server process to connect to. All monitor 
processes must connect to a Name Server process when NAMESERVER_ENABLE=1; 
otherwise, the monitor will terminate with an error.


was (Author: zcorrea):
There is no flow chart, but starting a Trafodion instance is as follows:
 # MPIRUN or supervisord creates the monitor process across the configured 
nodes.
 # The monitor process initialization logic determines if it is running in 
AGENT mode or as a MPI collective
 ** There are the two code paths in which the monitor processes determine who 
the cluster members are. That is, the operational view of the cluster composed 
of monitor processes.
 ** We are working on making the AGENT mode the only way the monitor processes 
initialize so that we can remove the MPI collective initialization logic.
 # The monitor process initialization logic determines if the Name Server is 
enabled or not.

I have added an overview document which provides some pictures of monitor 
process creation and initialization activities which you can refer to it as I 
try to answer your other questions.
 # The monitors processes are created by mpirun in a Python installation and in 
the future by the supervisord in a Cloudera Manager installation. Who creates 
the process is not that important. How the monitor processes join together to 
create a Trafodion cluster is the change which is addressed by the 
MASTER/non-MASTER functionality. I have added logic in the monitor which is 
enabled by environment variable that will cause the monitor to run in AGENT 
mode which uses the MASTER/non-MASTER method of creating a cluster. Meaning 

[jira] [Commented] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-08 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392057#comment-16392057
 ] 

Gonzalo E Correa commented on TRAFODION-2884:
-

There is no flow chart, but starting a Trafodion instance is as follows:
 # MPIRUN or supervisord creates the monitor process across the configured 
nodes.
 # The monitor process initialization logic determines if it is running in 
AGENT mode or as a MPI collective
 ## There are the two code paths in which the monitor processes determine who 
the cluster members are. That is, the operational view of the cluster composed 
of monitor processes.
 ## We are working on making the AGENT mode the only way the monitor processes 
initialize so that we can remove the MPI collective initialization logic.
 # The monitor process initialization logic determines if the Name Server is 
enabled or not.

I have added an overview document which provides some pictures of monitor 
process creation and initialization activities which you can refer to it as I 
try to answer your other questions.
 # The monitors processes are created by mpirun in a Python installation and in 
the future by the supervisord in a Cloudera Manager installation. Who creates 
the process is not that important. How the monitor processes join together to 
create a Trafodion cluster is the change which is addressed by the 
MASTER/non-MASTER functionality. I have added logic in the monitor which is 
enabled by environment variable that will cause the monitor to run in AGENT 
mode which uses the MASTER/non-MASTER method of creating a cluster. Meaning 
non-MASTER monitor processes join the cluster through the MASTER monitor 
process. The environment variables will be documented as follows:

NOTE: This is a work in process so it may change!

# Monitor process creator:
#
# MPIRUN - monitor process is created by mpirun
#
# Uncomment SQ_MON_CREATOR when running monitor in AGENT mode
#export SQ_MON_CREATOR=MPIRUN

# Monitor process run mode:
#
# AGENT - monitor process runs in agent mode versus MPI collective
#
# Uncomment the next three environment variables
#export SQ_MON_RUN_MODE=AGENT
#export MONITOR_COMM_PORT=23399
#export MONITOR_SYNC_PORT=23398
#
# NAME-SERVER - to disable process replication and enable name-server
#
# Uncomment the next four environment variables
#export NAMESERVER_ENABLE=1
#export NS_COMM_PORT=23397
#export NS_SYNC_PORT=23396
#export NS_M2N_COMM_PORT=23395

So a Python installation can set the MPIRUN and AGENT environment variables 
which will tell the monitor MPIRUN is the creator and it is to run in AGENT 
mode.

In addition, the Name Server logic can be enabled or disabled. This is separate 
from the monitor executing in AGENT mode.


 # The Name Server processes will always behave as if they were in AGENT mode, 
meaning that the MASTER/non-MASTER method is how they initialize the set. The 
non-MASTER Name Server processes join the set though the MASTER Name Server 
Process.
 # They don't. The monitor processes at initialization time will create a Name 
Server process as the first 'primitive process if there is a Name Server 
configured to run in its node and connect to it. Otherwise, the monitor process 
will select an existing Name Server process to connect to. All monitor 
processes must connect to a Name Server process when NAMESERVER_ENABLE=1; 
otherwise, the monitor will terminate with an error.

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: 
> TRAFODION-2884-Scalable_Name_Space-Architecure-review.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes-review.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Overview-20180308.pptx
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-08 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-2884:

Attachment: TRAFODION-2884-Scalable_Name_Space-Overview-20180308.pptx

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: 
> TRAFODION-2884-Scalable_Name_Space-Architecure-review.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes-review.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Overview-20180308.pptx
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-06 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-2884:

Attachment: TRAFODION-2884-Scalable_Name_Space-DesignNotes-review.v1.0.pdf
TRAFODION-2884-Scalable_Name_Space-DesignNotes.v1.0.pdf

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: 
> TRAFODION-2884-Scalable_Name_Space-Architecure-review.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes-review.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v1.0.pdf
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-06 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-2884:

Attachment: TRAFODION-2884-Scalable_Name_Space-Architecure-review.v2.2.pdf
TRAFODION-2884-Scalable_Name_Space-Architecure.v2.2.pdf

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: 
> TRAFODION-2884-Scalable_Name_Space-Architecure-review.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.2.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes-review.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v1.0.pdf
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-01 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382654#comment-16382654
 ] 

Gonzalo E Correa commented on TRAFODION-2884:
-

Updated architecture document the death registration and delivery changes.

Will follow up with design notes document to reflect architecture.

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v2.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-03-01 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382648#comment-16382648
 ] 

Gonzalo E Correa commented on TRAFODION-2884:
-

Great suggestion!

This simplifies the design and functionality of the TRAFNMSVR and make further 
use of the monitor-2-monitor point-2-point communication.

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-03-01 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa resolved TRAFODION-2883.
-
Resolution: Done

> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-27 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379654#comment-16379654
 ] 

Gonzalo E Correa commented on TRAFODION-2883:
-

I have created the pull request which will need a code review before it can be 
merged.

[https://github.com/apache/trafodion/pull/1457]

These changes include the ability to run the monitor processes in AGENT mode 
from a Python installation plus several other scale related changes and bug 
fixes.

To enable AGENT mode, uncomment the following environment variables in 
sqenvcom.sh and copy to all nodes.
{panel:title=sqenvcom.sh }
# Monitor process creator:
 #   MPIRUN - monitor process is created by mpirun
 # Uncomment SQ_MON_CREATOR when running monitor in AGENT mode
 #export SQ_MON_CREATOR=MPIRUN
 
 # Monitor process run mode:
 #   AGENT - monitor process runs in agent mode versus MPI collective
 # Uncomment the three environment variables below
 #export SQ_MON_RUN_MODE=AGENT
 #export MONITOR_COMM_PORT=23399
 #export MONITOR_SYNC_PORT=2339
{panel}
An alternative to the above is to add the following to sql/scripts/shell.env:

SQ_MON_CREATOR=MPIRUN
 SQ_MON_RUN_MODE=AGENT
 MONITOR_COMM_PORT=23399
 MONITOR_SYNC_PORT=23398

With regard to enabling monitor trace when in AGENT mode, use the file in 
sql/scripts/monitor.env and uncomment the trace level desired.

Once this is merged to the baseline, I will merge up these changes to the 
shared TRAFODION-2884 branch in the zcorrea_fork

> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-27 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-2883:

Description: 
Initial changes required to:

  - AGENT mode monitor

      o Preliminary change to remove dependency on OpenMPI during 
initialization of operational cluster by creating a cluster
          of one node (MASTER monitor) where other remote nodes (SLAVE 
monitors) join the cluster through the MASTER

 - MASTER monitor selection

 - Scale bug fixes found when creating clusters greater than 120 nodes

  was:
Initial changes required to:

 - Increase the size of Trafodion instance from 256 servers to 1024 servers

 - AGENT mode monitor

      o Preliminary change to remove dependency on OpenMPI during 
initialization of operational cluster by creating a cluster
          of one node (MASTER monitor) where other remote nodes (SLAVE 
monitors) join the cluster through the MASTER

 - MASTER monitor selection

 - Scale bug fixes found when creating clusters greater than 120 nodes


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>   - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>           of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-02-27 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-2883:

Description: 
Initial changes required to:

 - Increase the size of Trafodion instance from 256 servers to 1024 servers

 - AGENT mode monitor

      o Preliminary change to remove dependency on OpenMPI during 
initialization of operational cluster by creating a cluster
         of one node (MASTER monitor) where other remote nodes (SLAVE monitors) 
join the cluster through the MASTER

 - MASTER monitor selection

 - Scale bug fixes found when creating clusters greater than 120 nodes

  was:Initial changes required to increase the size of Trafodion instance from 
256 servers to 1024 servers.


> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Initial changes required to:
>  - Increase the size of Trafodion instance from 256 servers to 1024 servers
>  - AGENT mode monitor
>       o Preliminary change to remove dependency on OpenMPI during 
> initialization of operational cluster by creating a cluster
>          of one node (MASTER monitor) where other remote nodes (SLAVE 
> monitors) join the cluster through the MASTER
>  - MASTER monitor selection
>  - Scale bug fixes found when creating clusters greater than 120 nodes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-02-06 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354344#comment-16354344
 ] 

Gonzalo E Correa commented on TRAFODION-2884:
-

Attached are preliminary architecture and design documents. 

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-02-06 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-2884:

Attachment: TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf

> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
> Attachments: TRAFODION-2884-Scalable_Name_Space-Architecure.v1.0.pdf, 
> TRAFODION-2884-Scalable_Name_Space-Architecure.v1.1.pdf, 
> TRAFODION-2884-Scalable_Name_Space-DesignNotes.v0.1.pdf
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TRAFODION-2881) Multiple node failures occur during HA testing

2018-02-01 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa resolved TRAFODION-2881.
-
Resolution: Fixed

Changes in: https://github.com/apache/trafodion/pull/1392

> Multiple node failures occur during HA testing
> --
>
> Key: TRAFODION-2881
> URL: https://issues.apache.org/jira/browse/TRAFODION-2881
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
>Priority: Major
> Fix For: 2.3
>
>
> Inflicting server failure in certain modes will cause multiple monitor 
> process to also bring their nodes down along with the intended target of the 
> test.
> Server down modes:
> init 6
> reboot -f
> shutdown -r now
> shell node down command
> In addition, after a server down, the shell 'node up' command will also fail 
> intermittently. This requires a longevity HA test to down and up nodes over a 
> long period of time like 24-48 hours.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TRAFODION-2907) Remove usage of TRAF_EXCLUDE_LIST

2018-01-12 Thread Gonzalo E Correa (JIRA)
Gonzalo E Correa created TRAFODION-2907:
---

 Summary: Remove usage of  TRAF_EXCLUDE_LIST
 Key: TRAFODION-2907
 URL: https://issues.apache.org/jira/browse/TRAFODION-2907
 Project: Apache Trafodion
  Issue Type: Task
  Components: foundation
Affects Versions: 2.3
Reporter: Gonzalo E Correa
Assignee: Gonzalo E Correa
Priority: Minor
 Fix For: 2.3


The usage of  TRAF_EXCLUDE_LIST nodes is obsolete since the introduction of 
TRAFODION-2001.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Work started] (TRAFODION-2884) Trafodion Foundation Scalability Enhancements

2018-01-11 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on TRAFODION-2884 started by Gonzalo E Correa.
---
> Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2884
> URL: https://issues.apache.org/jira/browse/TRAFODION-2884
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> Architectural changes are needed in the Trafodion Foundation components to 
> effectively scale above 256 servers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Work started] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-01-11 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on TRAFODION-2883 started by Gonzalo E Correa.
---
> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> Initial changes required to increase the size of Trafodion instance from 256 
> servers to 1024 servers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TRAFODION-2883) Preliminary Trafodion Foundation Scalability Enhancements

2018-01-08 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa updated TRAFODION-2883:

Summary: Preliminary Trafodion Foundation Scalability Enhancements  (was: 
Preliminary Trafodion Scalability Enhancements)

> Preliminary Trafodion Foundation Scalability Enhancements
> -
>
> Key: TRAFODION-2883
> URL: https://issues.apache.org/jira/browse/TRAFODION-2883
> Project: Apache Trafodion
>  Issue Type: Improvement
>  Components: dtm, foundation, installer
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> Initial changes required to increase the size of Trafodion instance from 256 
> servers to 1024 servers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TRAFODION-2883) Preliminary Trafodion Scalability Enhancements

2018-01-04 Thread Gonzalo E Correa (JIRA)
Gonzalo E Correa created TRAFODION-2883:
---

 Summary: Preliminary Trafodion Scalability Enhancements
 Key: TRAFODION-2883
 URL: https://issues.apache.org/jira/browse/TRAFODION-2883
 Project: Apache Trafodion
  Issue Type: Improvement
  Components: dtm, foundation, installer
Affects Versions: 2.3
Reporter: Gonzalo E Correa
Assignee: Gonzalo E Correa
 Fix For: 2.3


Initial changes required to increase the size of Trafodion instance from 256 
servers to 1024 servers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TRAFODION-2882) Foundation infrastructure changes needed to support operating in Cloudera Manager environment

2018-01-04 Thread Gonzalo E Correa (JIRA)
Gonzalo E Correa created TRAFODION-2882:
---

 Summary: Foundation infrastructure changes needed to support 
operating in Cloudera Manager environment
 Key: TRAFODION-2882
 URL: https://issues.apache.org/jira/browse/TRAFODION-2882
 Project: Apache Trafodion
  Issue Type: Improvement
  Components: foundation
Affects Versions: 2.3
Reporter: Gonzalo E Correa
Assignee: Gonzalo E Correa
 Fix For: 2.3


The method for starting a Trafodion instance is based on Open MPI. A different 
method is needed to remove this dependency and to allow for larger cluster 
configuration installations.

This calls for a different method of instantiating a Trafodion cluster instance 
which utilizes existing node reintegration, i.e., node up, capability and is 
not dependent on Open MPI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TRAFODION-2881) Multiple node failures occur during HA testing

2018-01-04 Thread Gonzalo E Correa (JIRA)
Gonzalo E Correa created TRAFODION-2881:
---

 Summary: Multiple node failures occur during HA testing
 Key: TRAFODION-2881
 URL: https://issues.apache.org/jira/browse/TRAFODION-2881
 Project: Apache Trafodion
  Issue Type: Bug
  Components: foundation
Affects Versions: 2.3
Reporter: Gonzalo E Correa
Assignee: Gonzalo E Correa
 Fix For: 2.3


Inflicting server failure in certain modes will cause multiple monitor process 
to also bring their nodes down along with the intended target of the test.

Server down modes:

init 6
reboot -f
shutdown -r now
shell node down command

In addition, after a server down, the shell 'node up' command will also fail 
intermittently. This requires a longevity HA test to down and up nodes over a 
long period of time like 24-48 hours.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TRAFODION-2664) Instance will be down when the zookeeper on name node has been down

2018-01-02 Thread Gonzalo E Correa (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308899#comment-16308899
 ] 

Gonzalo E Correa commented on TRAFODION-2664:
-

This issue was fixed and should be marked resolved as of Release 2.2.

> Instance will be down when the zookeeper on name node has been down
> ---
>
> Key: TRAFODION-2664
> URL: https://issues.apache.org/jira/browse/TRAFODION-2664
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.2-incubating
> Environment: Test Environment:
> CDH5.4.8: 10.10.23.19:7180, total 6 nodes.
> HDFS-HA and DCS-HA: enabled
> OS: Centos6.8, physic machine.
> SW Build: R2.2.3 (EsgynDB_Enterprise Release 2.2.3 (Build release [sbroeder], 
> branch 1ce8d39-xdc_nari, date 11Jun17)
>Reporter: Jarek
>Assignee: Gonzalo E Correa
>Priority: Critical
>  Labels: build
> Fix For: 2.2-incubating
>
>
> Description: Instance will be down when the zookeeper on name node has been 
> down
> Test Steps:
> Step 1. Start OE and 4 long queries with trafci on the first node 
> esggy-clu-n010
> Step 2. Wait several minutes and stop zookeeper on name node of node 
> esggy-clu-n010  in Cloudera Manager page.
> Step 3. With trafci, run a basic query and 4 long queries again.
> In the above Step 3, we will see the whole instance as down after a while. 
> For this test scenario, I tried it several times, always found instance as 
> down.
> Timestamp:
> Test Start Time: 20170616132939
> Test End  Time: 20170616134350
> Stop zookeeper on name node of node esggy-clu-n010: 20170616133344
> Check logs:
> 1) Each node displays the following error:
> 2017-06-16 13:33:46,276, ERROR, MON, Node Number: 0,, PIN: 5017 , Process 
> Name: $MONITOR,,, TID: 5429, Message ID: 101371801, 
> [CZClient::IsZNodeExpired], zoo_exists() for 
> /trafodion/instance/cluster/esggy-clu-n010.esgyn.cn failed with error 
> ZCONNECTIONLOSS
> 2) Zookeeper displays:
> ls /trafodion/instance/cluster
> []
> So, It seems zclient has been lost on each node.
> Location of logs:
> esggy-clu-n010: 
> /data4/jarek/ha.interactive/trafodion_and_cluster_logs/cluster_logs.20170616134816.tar.gz
>  and trafodion_logs.20170616134816.tar.gz
> By the way, because the size of the logs is out of the limited value, so i 
> cannot upload it as the attachment in this JIRA ID.
> How many zookeeper quorum servers in the cluster? total 3.
>   
> dcs.zookeeper.quorum
> 
> esggy-clu-n010.esgyn.cn,esggy-clu-n011.esgyn.cn,esggy-clu-n012.esgyn.cn
>   
> How to access the cluster?
> 1. Login 10.10.10.8 from US machine: trafodion/traf123
> 2. Login 10.10.23.19 from 10.10.10.8: trafodion/traf123



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TRAFODION-2746) Monitor exhibits memory corruption in large cluster configuration > 30 nodes

2018-01-02 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa resolved TRAFODION-2746.
-
Resolution: Fixed

> Monitor exhibits memory corruption in large cluster configuration > 30 nodes
> 
>
> Key: TRAFODION-2746
> URL: https://issues.apache.org/jira/browse/TRAFODION-2746
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.3
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> Found the following problems in the monitor when trying to bring up 120 nodes:
> 1.A segmentation violation occurred during the Integration phase, when 
> the new monitor is establishing the socket communication paths between itself 
> and the existing monitors.
> 2.A second segmentation violation was due to a buffer overwrite during 
> the Joining (revive) phase.
> 3.One of the monitor would remain in the Joining state and never come out 
> of it.
> 4.Stderr buffer overwrite in CRedirectStderr::handleOutput()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TRAFODION-2651) The monitor to monitor process communication cannot handle a network reset

2018-01-02 Thread Gonzalo E Correa (JIRA)

 [ 
https://issues.apache.org/jira/browse/TRAFODION-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gonzalo E Correa resolved TRAFODION-2651.
-
Resolution: Fixed

> The monitor to monitor process communication cannot handle a network reset 
> ---
>
> Key: TRAFODION-2651
> URL: https://issues.apache.org/jira/browse/TRAFODION-2651
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: foundation
>Affects Versions: 2.2-incubating
>Reporter: Gonzalo E Correa
>Assignee: Gonzalo E Correa
> Fix For: 2.3
>
>
> The monitor to monitor socket communication does not have reconnect logic to 
> handle a network reset or transient network errors.
> Analysis:
> • During a ~20 second network reset window, no errors are detected by 
> open sockets
> o Open sockets are dead, but there is no indication from the TCP/IP stack 
> that socket is in an error condition
> • Once the network is restored, a CONNECTIONLOSS is reported by the 
> Zookeeper Client Library.
> o However, reconnect logic reestablishes connection with quorum.
> • At EPOLL expiration time, EPOLL logic report “Not heard from peer=n” 
> and treats peer as Node Down.
> o The node down logic deletes corresponding znode, 
> CZClient::WatchNodeDelete()
> o All monitor processes continually check for expired znodes for each 
> node in the cluster, including their own znode
>  An expired znode is handled as a down node



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)