[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388988#comment-16388988
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172735971
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
@@ -262,12 +308,38 @@ public void run() {
 future.get();// block
 
 } catch (Exception e) {
-LOG.error(e);
-e.printStackTrace();
+LOG.error(e.getMessage(), e);
+try {
+FloatingIp floatingIp = FloatingIp.getInstance(this);
+floatingIp.unbindScript();
+} catch (Exception e1) {
+if (LOG.isErrorEnabled()) {
+LOG.error("Error creating class FloatingIp [" + 
e.getMessage() + "]", e1);
+}
+}
 if (pool != null)
 pool.shutdown();
-System.exit(0);
+if (ls != null) {
+ListenerWorker lw = ls.getWorker();
+if (lw != null) {
+lw.interrupt();
+LOG.info("Interrupt listenerWorker.");
+}
+ls.interrupt();
+LOG.info("Interrupt listenerService.");
+}
+if (infoServer != null) {
+try {
+infoServer.stop();
+LOG.info("Stop infoServer.");
+} catch (Exception e1) {
+LOG.error(e1.getMessage(), e1);
+}
+}
+return 1;
+
--- End diff --

Add try-catch for each progress.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388986#comment-16388986
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172729543
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
@@ -111,11 +104,59 @@ public DcsMaster(String[] args) {
 trafodionHome = System.getProperty(Constants.DCS_TRAFODION_HOME);
 jvmShutdownHook = new JVMShutdownHook();
 Runtime.getRuntime().addShutdownHook(jvmShutdownHook);
-thrd = new Thread(this);
-thrd.start();
+
+ExecutorService executorService = Executors.newFixedThreadPool(1);
+CompletionService completionService = new 
ExecutorCompletionService(executorService);
+
+while (true) {
+completionService.submit(this);
+Future f = null;
+try {
+f = completionService.take();
+if (f != null) {
+Integer status = f.get();
+if (status <= 0) {
+System.exit(status);
+} else {
+// 35000 * 15mins ~= 1 years
+RetryCounter retryCounter = 
RetryCounterFactory.create(35000, 15, TimeUnit.MINUTES);
+while (true) {
+try {
+ZkClient tmpZkc = new ZkClient();
+tmpZkc.connect();
+tmpZkc.close();
+tmpZkc = null;
+LOG.info("Connected to ZooKeeper 
successful, restart DCS Master.");
+// reset lock
+isLeader = new CountDownLatch(1);
+break;
--- End diff --

As we discussed, Zookeeper connection lost has been covered by session 
expired event, so this loop is useless.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388995#comment-16388995
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172737467
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/zookeeper/ZkClient.java ---
@@ -176,7 +208,19 @@ public ZooKeeper getZk() {
public void process(WatchedEvent event) {
if(event.getState() == Watcher.Event.KeeperState.SyncConnected) 
{
connectedSignal.countDown();
-   }
+   } else if (event.getState() == 
Watcher.Event.KeeperState.Expired) {
+LOG.info("session expired. now rebuilding");
+// session expired, may be never happending. but if it happen 
there
+// need to close old client and rebuild new client
+try {
+connect(true);
--- End diff --

In connect(...), I saw it was catch for zookeeper exception, how can get 
exception here?


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388994#comment-16388994
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172736627
  
--- Diff: 
dcs/src/main/java/org/trafodion/dcs/master/listener/ListenerWorker.java ---
@@ -93,6 +93,7 @@ public void run() {
 try {
 queue.wait();
 } catch (InterruptedException e) {
+return;
--- End diff --

Log message for what happened by debug level.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388997#comment-16388997
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172736393
  
--- Diff: 
dcs/src/main/java/org/trafodion/dcs/master/listener/ListenerService.java ---
@@ -263,14 +264,20 @@ else if ((key.interestOps() & SelectionKey.OP_WRITE) 
== SelectionKey.OP_WRITE) {
 }
 }
 }
+if (this.isInterrupted()) {
+throw new InterruptedException();
+}
 //gc();
 }
+} catch (InterruptedException e) {
--- End diff --

Log what happens by info level at least.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388990#comment-16388990
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172736542
  
--- Diff: 
dcs/src/main/java/org/trafodion/dcs/master/listener/ListenerService.java ---
@@ -263,14 +264,20 @@ else if ((key.interestOps() & SelectionKey.OP_WRITE) 
== SelectionKey.OP_WRITE) {
 }
 }
 }
+if (this.isInterrupted()) {
+throw new InterruptedException();
+}
 //gc();
 }
+} catch (InterruptedException e) {
 } catch (IOException e) {
 LOG.error(e);
 System.exit(1);
 } finally {
+LOG.info("close ServerSocketChannel...");
 if (server != null) {
 try {
+server.socket().close();
 server.close();
 } catch (IOException e) {
 e.printStackTrace();
--- End diff --

it's better to print the message into log file.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388999#comment-16388999
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172736968
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/zookeeper/ZkClient.java ---
@@ -151,8 +163,28 @@ public void connect() throws IOException, 
InterruptedException {
this.zk=null;
throw new IOException("Cannot connect to 
Zookeeper");
}
-   
-   LOG.debug("Zookeeper.State=" + this.zk.getState());
+
+// Solve the forcible reconnection
+// When zk reconn, the backup-master may take over the master,
+// so current master should restart, and queues in 
/dcs/master/leader
+if (LOG.isDebugEnabled()) {
+LOG.debug("force = [" + force + "]. checkPath = [" + 
checkPath + "]");
+}
+if (force && checkPath != null) {
+try {
+Stat stat = zk.exists(checkPath, false);
+if (LOG.isDebugEnabled()) {
+LOG.debug("stat = [" + stat + "].");
+}
+if (stat == null) {
+// this means master has change.
+setSessionRecoverSuccessful(false);
+}
+} catch (KeeperException e) {
+e.printStackTrace();
--- End diff --

Log messages in file.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388998#comment-16388998
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172736451
  
--- Diff: 
dcs/src/main/java/org/trafodion/dcs/master/listener/ListenerService.java ---
@@ -263,14 +264,20 @@ else if ((key.interestOps() & SelectionKey.OP_WRITE) 
== SelectionKey.OP_WRITE) {
 }
 }
 }
+if (this.isInterrupted()) {
+throw new InterruptedException();
+}
 //gc();
 }
+} catch (InterruptedException e) {
 } catch (IOException e) {
 LOG.error(e);
--- End diff --

it's better to use 2 parameters API for LOG.error


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388989#comment-16388989
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172733924
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
@@ -129,19 +170,19 @@ public void run() {
 instance = "1";
 } catch (NullPointerException e) {
 LOG.error("No args found: ", e);
-System.exit(1);
+return -1;
 } catch (ParseException e) {
 LOG.error("Could not parse: ", e);
-System.exit(1);
+return -1;
 }
 
 try {
 zkc = new ZkClient();
 zkc.connect();
 LOG.info("Connected to ZooKeeper");
 } catch (Exception e) {
-LOG.error(e);
-System.exit(1);
+LOG.error(e.getMessage(), e);
+return 1;
--- End diff --

Close zk connection while any connection exception happens. it's better to 
use specific Exception for zookeeper.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388991#comment-16388991
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172737025
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/zookeeper/ZkClient.java ---
@@ -176,7 +208,19 @@ public ZooKeeper getZk() {
public void process(WatchedEvent event) {
if(event.getState() == Watcher.Event.KeeperState.SyncConnected) 
{
connectedSignal.countDown();
-   }
+   } else if (event.getState() == 
Watcher.Event.KeeperState.Expired) {
+LOG.info("session expired. now rebuilding");
+// session expired, may be never happending. but if it happen 
there
+// need to close old client and rebuild new client
+try {
+connect(true);
+} catch (IOException e) {
+setSessionRecoverSuccessful(false);
+LOG.error("session expired and throw IOException while do 
reconnect: " + e.getMessage());
--- End diff --

Use 2 params API for log. If it doesn't matter for whatever the message is, 
it should be LOG.warn.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388993#comment-16388993
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172736344
  
--- Diff: 
dcs/src/main/java/org/trafodion/dcs/master/listener/ListenerService.java ---
@@ -263,14 +264,20 @@ else if ((key.interestOps() & SelectionKey.OP_WRITE) 
== SelectionKey.OP_WRITE) {
 }
 }
 }
+if (this.isInterrupted()) {
+throw new InterruptedException();
--- End diff --

Add some specific message for this exception.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388996#comment-16388996
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172737047
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/zookeeper/ZkClient.java ---
@@ -176,7 +208,19 @@ public ZooKeeper getZk() {
public void process(WatchedEvent event) {
if(event.getState() == Watcher.Event.KeeperState.SyncConnected) 
{
connectedSignal.countDown();
-   }
+   } else if (event.getState() == 
Watcher.Event.KeeperState.Expired) {
+LOG.info("session expired. now rebuilding");
+// session expired, may be never happending. but if it happen 
there
+// need to close old client and rebuild new client
+try {
+connect(true);
+} catch (IOException e) {
+setSessionRecoverSuccessful(false);
+LOG.error("session expired and throw IOException while do 
reconnect: " + e.getMessage());
+} catch (InterruptedException e) {
+LOG.error("session expired and throw InterruptedException 
while do reconnect: " + e.getMessage());
--- End diff --

The same as above.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388987#comment-16388987
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172734228
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
@@ -203,8 +244,8 @@ public void run() {
 } catch (KeeperException.NodeExistsException e) {
 // do nothing...some other server has created znodes
--- End diff --

it should be better to add warning for this exception that we could what 
happened. Any events happened should be reflected in log files.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388992#comment-16388992
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user kevinxu021 commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172736094
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/master/FloatingIp.java ---
@@ -61,6 +61,43 @@ public boolean isEnabled() {
 return isEnabled;
 }
 
+public synchronized int unbindScript() throws Exception {
+if (isEnabled)
+LOG.info("Floating IP is enabled");
+else {
+LOG.info("Floating IP is disabled");
+return 0;
+}
+
+ScriptContext scriptContext = new ScriptContext();
+scriptContext.setScriptName(Constants.SYS_SHELL_SCRIPT_NAME);
+scriptContext.setStripStdOut(false);
+scriptContext.setStripStdErr(false);
+
+String command = 
master.getConfiguration().get(Constants.DCS_MASTER_FLOATING_IP_COMMAND_UNBIND,
+Constants.DEFAULT_DCS_MASTER_FLOATING_IP_COMMAND_UNBIND);
+
+scriptContext.setCommand(command);
+LOG.info("Unbind Floating IP [" + scriptContext.getCommand() + 
"]");
+ScriptManager.getInstance().runScript(scriptContext);// Blocking 
call
+
+StringBuilder sb = new StringBuilder();
+sb.append("exit code [" + scriptContext.getExitCode() + "]");
+if (!scriptContext.getStdOut().toString().isEmpty())
+sb.append(", stdout [" + scriptContext.getStdOut().toString() 
+ "]");
+if (!scriptContext.getStdErr().toString().isEmpty())
+sb.append(", stderr [" + scriptContext.getStdErr().toString() 
+ "]");
+if (LOG.isErrorEnabled())
+LOG.info(sb.toString());
+
+if (scriptContext.getExitCode() == 0)
+LOG.info("Unbind Floating IP successful");
+else
+LOG.error("Unbind Floating IP failed");
--- End diff --

It's better to print exit code.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387540#comment-16387540
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user mashengchen commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r172462100
  
--- Diff: dcs/bin/scripts/dcsunbind.sh ---
@@ -51,6 +51,31 @@ function check_node {
 done
 }
 
+function check_self_node {
+for myinterface in `/sbin/ip link show|cut -d: -f1- | cut -c1- | awk 
-F': ' '/^[0-9]+:.*/ {print $2;}'`; do
+ip_output=$(/sbin/ip addr show $myinterface | cut -d: -f1- | cut 
-c1-)
+
+myifport=`echo "$ip_output" | grep -w $gv_float_external_ip`
+status=$?
+if [ $status -eq 0 ]; then
+tempinterface=`echo $gv_float_interface:$gv_port`
+# check if another interface is bound to this virtual ip 
address
+echo "$myifport" | grep "$tempinterface"  > /dev/null
+if [ $? -eq 0 ]; then
+unbindip=`echo "$myifport" | awk '{print $2}'`
+unbindlb=`echo "$myifport"|awk '{print $NF}'`
+echo "Virtual ip $gv_float_external_ip is in use on node 
$HOSTNAME bound to interface $myinterface($unbindlb) - unbinding..."
+sudo /sbin/ip addr del $unbindip dev $myinterface
+status=$?
--- End diff --

this function is a little different from the one using PDSH,  the ouput of 
`$SQ_PDSH $MY_NODES /sbin/ip addr show` is different from `/sbin/ip link show` 
, so the cut step after the " | " (pipe) is different. 


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366346#comment-16366346
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user svarnau commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r168625988
  
--- Diff: dcs/bin/scripts/dcsunbind.sh ---
@@ -51,6 +51,31 @@ function check_node {
 done
 }
 
+function check_self_node {
+for myinterface in `/sbin/ip link show|cut -d: -f1- | cut -c1- | awk 
-F': ' '/^[0-9]+:.*/ {print $2;}'`; do
+ip_output=$(/sbin/ip addr show $myinterface | cut -d: -f1- | cut 
-c1-)
+
+myifport=`echo "$ip_output" | grep -w $gv_float_external_ip`
+status=$?
+if [ $status -eq 0 ]; then
+tempinterface=`echo $gv_float_interface:$gv_port`
+# check if another interface is bound to this virtual ip 
address
+echo "$myifport" | grep "$tempinterface"  > /dev/null
+if [ $? -eq 0 ]; then
+unbindip=`echo "$myifport" | awk '{print $2}'`
+unbindlb=`echo "$myifport"|awk '{print $NF}'`
+echo "Virtual ip $gv_float_external_ip is in use on node 
$HOSTNAME bound to interface $myinterface($unbindlb) - unbinding..."
+sudo /sbin/ip addr del $unbindip dev $myinterface
+status=$?
--- End diff --

It seems unnecessary to duplicate entire function just to change whether 
commands run locally or remotely. That can be done with a variable, so that we 
don't have two copies of this code to maintain.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-02-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354877#comment-16354877
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user mashengchen commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r166504107
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
@@ -111,11 +104,59 @@ public DcsMaster(String[] args) {
 trafodionHome = System.getProperty(Constants.DCS_TRAFODION_HOME);
 jvmShutdownHook = new JVMShutdownHook();
 Runtime.getRuntime().addShutdownHook(jvmShutdownHook);
-thrd = new Thread(this);
-thrd.start();
+
+ExecutorService executorService = Executors.newFixedThreadPool(1);
+CompletionService completionService = new 
ExecutorCompletionService(executorService);
+
+while (true) {
+completionService.submit(this);
+Future f = null;
+try {
+f = completionService.take();
+if (f != null) {
+Integer status = f.get();
+if (status <= 0) {
+System.exit(status);
+} else {
+// 35000 * 15mins ~= 1 years
+RetryCounter retryCounter = 
RetryCounterFactory.create(35000, 15, TimeUnit.MINUTES);
+while (true) {
+try {
+ZkClient tmpZkc = new ZkClient();
+tmpZkc.connect();
+tmpZkc.close();
+tmpZkc = null;
+LOG.info("Connected to ZooKeeper 
successful, restart DCS Master.");
+// reset lock
+isLeader = new CountDownLatch(1);
+break;
--- End diff --

this logic is for when dcsmaster return with network erro situation.
in the logic , it will try to connect to zk
if it can't conn ( tmpZkc.connect(); ) , there will in catch block and do 
retry
if it connect to zk, then dcsmaster will run call() method again, in the 
time dcsmaster rework ,there must hava another backup-master working ( because 
there must one dcs master work and current master lose network ,then 
backupmaster take over the role) ,  so when dcsmaster rework , it will set 
value in zk /rootpath/dcs/leader/ then hang by lock "isLeader = new 
CountDownLatch(1);"


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-02-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354514#comment-16354514
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

Github user DaveBirdsall commented on a diff in the pull request:

https://github.com/apache/trafodion/pull/1427#discussion_r166434382
  
--- Diff: dcs/src/main/java/org/trafodion/dcs/master/DcsMaster.java ---
@@ -111,11 +104,59 @@ public DcsMaster(String[] args) {
 trafodionHome = System.getProperty(Constants.DCS_TRAFODION_HOME);
 jvmShutdownHook = new JVMShutdownHook();
 Runtime.getRuntime().addShutdownHook(jvmShutdownHook);
-thrd = new Thread(this);
-thrd.start();
+
+ExecutorService executorService = Executors.newFixedThreadPool(1);
+CompletionService completionService = new 
ExecutorCompletionService(executorService);
+
+while (true) {
+completionService.submit(this);
+Future f = null;
+try {
+f = completionService.take();
+if (f != null) {
+Integer status = f.get();
+if (status <= 0) {
+System.exit(status);
+} else {
+// 35000 * 15mins ~= 1 years
+RetryCounter retryCounter = 
RetryCounterFactory.create(35000, 15, TimeUnit.MINUTES);
+while (true) {
+try {
+ZkClient tmpZkc = new ZkClient();
+tmpZkc.connect();
+tmpZkc.close();
+tmpZkc = null;
+LOG.info("Connected to ZooKeeper 
successful, restart DCS Master.");
+// reset lock
+isLeader = new CountDownLatch(1);
+break;
--- End diff --

I'm not sure I understand this logic. Do we sit inside one of these method 
calls during normal processing? Or does tmpZkc.connect() and close() complete 
immediately? If so, it looks like we just loop around the while loop and do it 
again, over and over.


> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TRAFODION-2940) In HA env, one node lose network, when recover, trafci can't use

2018-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346593#comment-16346593
 ] 

ASF GitHub Bot commented on TRAFODION-2940:
---

GitHub user mashengchen opened a pull request:

https://github.com/apache/trafodion/pull/1427

TRAFODION-2940 In HA env, one node lose network, when recover, trafci can't 
use

when there loses network for a long time ,and then network recover, there 
will trigger zookeeper session expired, at this time ,check whether current 
dcsmaster is leader, if not unbind this node's floating ip, and make dcsmaster 
init status, then rerun dcs master.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mashengchen/trafodion TRAFODION-2940

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/trafodion/pull/1427.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1427


commit c92bd619a310f6b1b13d4164250ee8c8cb93f1e6
Author: aven 
Date:   2018-01-31T10:33:48Z

TRAFODION-2940 In HA env, one node lose network, when recover, trafci can't 
use




> In HA env, one node lose network, when recover, trafci can't use
> 
>
> Key: TRAFODION-2940
> URL: https://issues.apache.org/jira/browse/TRAFODION-2940
> Project: Apache Trafodion
>  Issue Type: Bug
>Affects Versions: any
>Reporter: mashengchen
>Assignee: mashengchen
>Priority: Major
> Fix For: 2.3
>
>
> In HA env, if one node lose network for a long time , once network recover, 
> there will have two floating ip, two working dcs master, and trafci can't be 
> use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)