[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328139466
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
 ##
 @@ -150,78 +151,106 @@ public void perform() throws Exception { }
   }
 
   protected void killMaster(ServerName server) throws IOException {
-LOG.info("Killing master " + server);
+LOG.info("Killing master {}", server);
 cluster.killMaster(server);
 cluster.waitForMasterToStop(server, killMasterTimeout);
 LOG.info("Killed master " + server);
   }
 
   protected void startMaster(ServerName server) throws IOException {
-LOG.info("Starting master " + server.getHostname());
+LOG.info("Starting master {}", server.getHostname());
 cluster.startMaster(server.getHostname(), server.getPort());
 cluster.waitForActiveAndReadyMaster(startMasterTimeout);
 LOG.info("Started master " + server.getHostname());
   }
 
+  protected void stopRs(ServerName server) throws IOException {
+LOG.info("Stopping regionserver {}", server);
+cluster.stopRegionServer(server);
+cluster.waitForRegionServerToStop(server, killRsTimeout);
 
 Review comment:
   The name might be a bit confusing but let just leave it like this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328031685
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartActionBaseAction.java
 ##
 @@ -50,6 +50,23 @@ void restartMaster(ServerName server, long sleepTime) 
throws IOException {
 startMaster(server);
   }
 
+  /**
+   * Stop and then restart the region server instaedof killing it.
 
 Review comment:
   typo instaedof


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328019811
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/GracefulRollingRestartRsAction.java
 ##
 @@ -0,0 +1,72 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hbase.chaos.actions;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.commons.lang3.RandomUtils;
+import org.apache.hadoop.hbase.ServerName;
+import org.apache.hadoop.hbase.util.RegionMover;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Gracefully restarts every non-admin regionserver in a rolling fashion. At 
each step, it unloads,
 
 Review comment:
   What is non-admin regionserver?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328019053
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
 ##
 @@ -269,6 +298,19 @@ protected void forceBalancer() throws Exception {
 }
   }
 
+  protected void setBalancer(boolean onOrOff, boolean synchronous) throws 
Exception {
+Admin admin = this.context.getHBaseIntegrationTestingUtility().getAdmin();
+boolean result = false;
+try {
+  result = admin.balancerSwitch(onOrOff, synchronous);
+} catch (Exception e) {
+  LOG.warn("Got exception while switching balance ", e);
+}
+if (!result) {
 
 Review comment:
   admin.balancerSwitch returns the previous state. If you disable balancer 
`result` will be `true` and logs the error.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328021113
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/GracefulRollingRestartRsAction.java
 ##
 @@ -0,0 +1,72 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hbase.chaos.actions;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.commons.lang3.RandomUtils;
+import org.apache.hadoop.hbase.ServerName;
+import org.apache.hadoop.hbase.util.RegionMover;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Gracefully restarts every non-admin regionserver in a rolling fashion. At 
each step, it unloads,
+ * restarts the loads every rs server sleeping randomly (0-sleepTime) in 
between servers.
+ */
+public class GracefulRollingRestartRsAction extends RestartActionBaseAction {
+  private static final Logger LOG = 
LoggerFactory.getLogger(GracefulRollingRestartRsAction.class);
+
+  public GracefulRollingRestartRsAction(long sleepTime) {
+super(sleepTime);
+  }
+
+  @Override
+  public void perform() throws Exception {
+LOG.info("Performing action: Rolling restarting non-master region 
servers");
+List selectedServers = selectServers();
+
+LOG.info("Disabling balancer to make unloading possible");
+setBalancer(false, false);
+
+for(ServerName server : selectedServers){
 
 Review comment:
   nit: add spaces around ()


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328030997
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/GracefulRollingRestartRsAction.java
 ##
 @@ -0,0 +1,72 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hbase.chaos.actions;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.commons.lang3.RandomUtils;
+import org.apache.hadoop.hbase.ServerName;
+import org.apache.hadoop.hbase.util.RegionMover;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Gracefully restarts every non-admin regionserver in a rolling fashion. At 
each step, it unloads,
+ * restarts the loads every rs server sleeping randomly (0-sleepTime) in 
between servers.
+ */
+public class GracefulRollingRestartRsAction extends RestartActionBaseAction {
+  private static final Logger LOG = 
LoggerFactory.getLogger(GracefulRollingRestartRsAction.class);
+
+  public GracefulRollingRestartRsAction(long sleepTime) {
+super(sleepTime);
+  }
+
+  @Override
+  public void perform() throws Exception {
+LOG.info("Performing action: Rolling restarting non-master region 
servers");
+List selectedServers = selectServers();
+
+LOG.info("Disabling balancer to make unloading possible");
+setBalancer(false, false);
+
+for(ServerName server : selectedServers){
+  String rsName = server.getAddress().toString();
+  try (RegionMover rm =
+  new RegionMover.RegionMoverBuilder(rsName, 
getConf()).ack(true).build()) {
+LOG.info("Unloading " + server);
+rm.unload();
+LOG.info("Restarting " + server);
+gracefulRestartRs(server, sleepTime);
+LOG.info("Loading " + server);
+rm.load();
+  } catch (org.apache.hadoop.util.Shell.ExitCodeException e) {
 
 Review comment:
   Import the class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328015214
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
 ##
 @@ -150,78 +151,106 @@ public void perform() throws Exception { }
   }
 
   protected void killMaster(ServerName server) throws IOException {
-LOG.info("Killing master " + server);
+LOG.info("Killing master {}", server);
 cluster.killMaster(server);
 cluster.waitForMasterToStop(server, killMasterTimeout);
 LOG.info("Killed master " + server);
   }
 
   protected void startMaster(ServerName server) throws IOException {
-LOG.info("Starting master " + server.getHostname());
+LOG.info("Starting master {}", server.getHostname());
 cluster.startMaster(server.getHostname(), server.getPort());
 cluster.waitForActiveAndReadyMaster(startMasterTimeout);
 LOG.info("Started master " + server.getHostname());
   }
 
+  protected void stopRs(ServerName server) throws IOException {
+LOG.info("Stopping regionserver {}", server);
+cluster.stopRegionServer(server);
+cluster.waitForRegionServerToStop(server, killRsTimeout);
 
 Review comment:
   Is it good to use killRsTimeout for stop timeout?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328032967
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RestartActionBaseAction.java
 ##
 @@ -50,6 +50,23 @@ void restartMaster(ServerName server, long sleepTime) 
throws IOException {
 startMaster(server);
   }
 
+  /**
+   * Stop and then restart the region server instaedof killing it.
+   * @param server hostname to restart the regionserver on
+   * @param sleepTime number of milliseconds between stop and restart
+   * @throws IOException if something goes wrong
+   */
+  void gracefulRestartRs(ServerName server, long sleepTime) throws IOException 
{
+sleepTime = Math.max(sleepTime, 1000);
+// Don't try the stop if we're stopping already
+if (context.isStopping()) {
+  return;
+}
+stopRs(server);
 
 Review comment:
   Please add logs to see the progress.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328060218
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RollingBatchSuspendResumeRsAction.java
 ##
 @@ -0,0 +1,116 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hbase.chaos.actions;
+
+import java.io.IOException;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Queue;
+
+import org.apache.commons.lang3.RandomUtils;
+import org.apache.hadoop.hbase.ServerName;
+import org.apache.hadoop.hbase.chaos.monkies.PolicyBasedChaosMonkey;
+import org.apache.hadoop.hbase.util.Threads;
+import org.apache.hadoop.util.Shell;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Suspend then resume a ratio of the regionservers in a rolling fashion. At 
each step, either
+ * suspend a server, or resume one, sleeping (sleepTime) in between steps. The 
parameter
+ * maxSuspendedServers limits the maximum number of servers that can be down 
at the same time
+ * during rolling restarts.
+ */
+public class RollingBatchSuspendResumeRsAction extends Action {
+  private static final Logger LOG =
+  LoggerFactory.getLogger(RollingBatchSuspendResumeRsAction.class);
+  private float ratio;
+  private long sleepTime;
+  private int maxSuspendedServers; // number of maximum suspended servers at 
any given time.
+
+  public RollingBatchSuspendResumeRsAction(long sleepTime, float ratio) {
+this(sleepTime, ratio, 5);
+  }
+
+  public RollingBatchSuspendResumeRsAction(long sleepTime, float ratio, int 
maxSuspendedServers) {
+this.ratio = ratio;
+this.sleepTime = sleepTime;
+this.maxSuspendedServers = maxSuspendedServers;
+  }
+
+  enum SuspendOrResume {
+SUSPEND, RESUME
+  }
+
+  @Override public void perform() throws Exception {
+LOG.info(String.format("Performing action: Rolling batch restarting %d%% 
of region servers",
+(int) (ratio * 100)));
+List selectedServers = selectServers();
+
+Queue serversToBeSuspended = new LinkedList<>(selectedServers);
+Queue suspendedServers = new LinkedList<>();
+
+// loop while there are servers to be suspended or suspended servers to be 
resumed
+while ((!serversToBeSuspended.isEmpty() || !suspendedServers.isEmpty()) && 
!context
+.isStopping()) {
+  SuspendOrResume action;
+
+  if (serversToBeSuspended.isEmpty()) { // no more servers to suspend
+action = SuspendOrResume.RESUME;
+  } else if (suspendedServers.isEmpty()) {
+action = SuspendOrResume.SUSPEND; // no more servers to resume
+  } else if (suspendedServers.size() >= maxSuspendedServers) {
+// we have too many suspended servers. Don't suspend any more
+action = SuspendOrResume.RESUME;
+  } else {
+// do a coin toss
+action = RandomUtils.nextBoolean() ? SuspendOrResume.SUSPEND : 
SuspendOrResume.RESUME;
+  }
+
+  ServerName server;
+  switch (action) {
+case SUSPEND:
+  server = serversToBeSuspended.remove();
+  try {
+suspendRs(server);
+  } catch (Shell.ExitCodeException e) {
+LOG.info("Problem suspending but presume successful; code=" + 
e.getExitCode(), e);
 
 Review comment:
   Log.warn?
   Use parameterized logging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328030516
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/GracefulRollingRestartRsAction.java
 ##
 @@ -0,0 +1,72 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hbase.chaos.actions;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.commons.lang3.RandomUtils;
+import org.apache.hadoop.hbase.ServerName;
+import org.apache.hadoop.hbase.util.RegionMover;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Gracefully restarts every non-admin regionserver in a rolling fashion. At 
each step, it unloads,
+ * restarts the loads every rs server sleeping randomly (0-sleepTime) in 
between servers.
+ */
+public class GracefulRollingRestartRsAction extends RestartActionBaseAction {
+  private static final Logger LOG = 
LoggerFactory.getLogger(GracefulRollingRestartRsAction.class);
+
+  public GracefulRollingRestartRsAction(long sleepTime) {
+super(sleepTime);
+  }
+
+  @Override
+  public void perform() throws Exception {
+LOG.info("Performing action: Rolling restarting non-master region 
servers");
+List selectedServers = selectServers();
+
+LOG.info("Disabling balancer to make unloading possible");
+setBalancer(false, false);
+
+for(ServerName server : selectedServers){
+  String rsName = server.getAddress().toString();
+  try (RegionMover rm =
+  new RegionMover.RegionMoverBuilder(rsName, 
getConf()).ack(true).build()) {
+LOG.info("Unloading " + server);
 
 Review comment:
   nit: parameterized logging


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328058038
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RollingBatchSuspendResumeRsAction.java
 ##
 @@ -0,0 +1,116 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hbase.chaos.actions;
+
+import java.io.IOException;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Queue;
+
+import org.apache.commons.lang3.RandomUtils;
+import org.apache.hadoop.hbase.ServerName;
+import org.apache.hadoop.hbase.chaos.monkies.PolicyBasedChaosMonkey;
+import org.apache.hadoop.hbase.util.Threads;
+import org.apache.hadoop.util.Shell;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Suspend then resume a ratio of the regionservers in a rolling fashion. At 
each step, either
+ * suspend a server, or resume one, sleeping (sleepTime) in between steps. The 
parameter
+ * maxSuspendedServers limits the maximum number of servers that can be down 
at the same time
+ * during rolling restarts.
+ */
+public class RollingBatchSuspendResumeRsAction extends Action {
+  private static final Logger LOG =
+  LoggerFactory.getLogger(RollingBatchSuspendResumeRsAction.class);
+  private float ratio;
+  private long sleepTime;
+  private int maxSuspendedServers; // number of maximum suspended servers at 
any given time.
+
+  public RollingBatchSuspendResumeRsAction(long sleepTime, float ratio) {
+this(sleepTime, ratio, 5);
+  }
+
+  public RollingBatchSuspendResumeRsAction(long sleepTime, float ratio, int 
maxSuspendedServers) {
+this.ratio = ratio;
+this.sleepTime = sleepTime;
+this.maxSuspendedServers = maxSuspendedServers;
+  }
+
+  enum SuspendOrResume {
+SUSPEND, RESUME
+  }
+
+  @Override public void perform() throws Exception {
 
 Review comment:
   @Override in new line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328020894
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/GracefulRollingRestartRsAction.java
 ##
 @@ -0,0 +1,72 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hbase.chaos.actions;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.commons.lang3.RandomUtils;
+import org.apache.hadoop.hbase.ServerName;
+import org.apache.hadoop.hbase.util.RegionMover;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Gracefully restarts every non-admin regionserver in a rolling fashion. At 
each step, it unloads,
+ * restarts the loads every rs server sleeping randomly (0-sleepTime) in 
between servers.
+ */
+public class GracefulRollingRestartRsAction extends RestartActionBaseAction {
+  private static final Logger LOG = 
LoggerFactory.getLogger(GracefulRollingRestartRsAction.class);
+
+  public GracefulRollingRestartRsAction(long sleepTime) {
+super(sleepTime);
+  }
+
+  @Override
+  public void perform() throws Exception {
+LOG.info("Performing action: Rolling restarting non-master region 
servers");
+List selectedServers = selectServers();
+
+LOG.info("Disabling balancer to make unloading possible");
+setBalancer(false, false);
 
 Review comment:
   This is an async call this way. Don't you need to wait for balancer to 
finish?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328061190
 
 

 ##
 File path: 
hbase-server/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java
 ##
 @@ -489,6 +499,32 @@ public String abortRegionServer(int serverNumber) {
 return server;
   }
 
+  /**
+   * Suspend the specified region server
+   * @param serverNumber Used as index into a list.
+   * @return
+   */
+  public JVMClusterUtil.RegionServerThread suspendRegionServer(int 
serverNumber) {
+JVMClusterUtil.RegionServerThread server =
+hbaseCluster.getRegionServers().get(serverNumber);
+LOG.info("Suspending " + server.toString());
 
 Review comment:
   parameterized logging


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328060329
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/RollingBatchSuspendResumeRsAction.java
 ##
 @@ -0,0 +1,116 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hbase.chaos.actions;
+
+import java.io.IOException;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Queue;
+
+import org.apache.commons.lang3.RandomUtils;
+import org.apache.hadoop.hbase.ServerName;
+import org.apache.hadoop.hbase.chaos.monkies.PolicyBasedChaosMonkey;
+import org.apache.hadoop.hbase.util.Threads;
+import org.apache.hadoop.util.Shell;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Suspend then resume a ratio of the regionservers in a rolling fashion. At 
each step, either
+ * suspend a server, or resume one, sleeping (sleepTime) in between steps. The 
parameter
+ * maxSuspendedServers limits the maximum number of servers that can be down 
at the same time
+ * during rolling restarts.
+ */
+public class RollingBatchSuspendResumeRsAction extends Action {
+  private static final Logger LOG =
+  LoggerFactory.getLogger(RollingBatchSuspendResumeRsAction.class);
+  private float ratio;
+  private long sleepTime;
+  private int maxSuspendedServers; // number of maximum suspended servers at 
any given time.
+
+  public RollingBatchSuspendResumeRsAction(long sleepTime, float ratio) {
+this(sleepTime, ratio, 5);
+  }
+
+  public RollingBatchSuspendResumeRsAction(long sleepTime, float ratio, int 
maxSuspendedServers) {
+this.ratio = ratio;
+this.sleepTime = sleepTime;
+this.maxSuspendedServers = maxSuspendedServers;
+  }
+
+  enum SuspendOrResume {
+SUSPEND, RESUME
+  }
+
+  @Override public void perform() throws Exception {
+LOG.info(String.format("Performing action: Rolling batch restarting %d%% 
of region servers",
+(int) (ratio * 100)));
+List selectedServers = selectServers();
+
+Queue serversToBeSuspended = new LinkedList<>(selectedServers);
+Queue suspendedServers = new LinkedList<>();
+
+// loop while there are servers to be suspended or suspended servers to be 
resumed
+while ((!serversToBeSuspended.isEmpty() || !suspendedServers.isEmpty()) && 
!context
+.isStopping()) {
+  SuspendOrResume action;
+
+  if (serversToBeSuspended.isEmpty()) { // no more servers to suspend
+action = SuspendOrResume.RESUME;
+  } else if (suspendedServers.isEmpty()) {
+action = SuspendOrResume.SUSPEND; // no more servers to resume
+  } else if (suspendedServers.size() >= maxSuspendedServers) {
+// we have too many suspended servers. Don't suspend any more
+action = SuspendOrResume.RESUME;
+  } else {
+// do a coin toss
+action = RandomUtils.nextBoolean() ? SuspendOrResume.SUSPEND : 
SuspendOrResume.RESUME;
+  }
+
+  ServerName server;
+  switch (action) {
+case SUSPEND:
+  server = serversToBeSuspended.remove();
+  try {
+suspendRs(server);
+  } catch (Shell.ExitCodeException e) {
+LOG.info("Problem suspending but presume successful; code=" + 
e.getExitCode(), e);
+  }
+  suspendedServers.add(server);
+  break;
+case RESUME:
+  server = suspendedServers.remove();
+  try {
+resumeRs(server);
+  } catch (Shell.ExitCodeException e) {
+LOG.info("Problem resuming, will retry; code=" + e.getExitCode(), 
e);
 
 Review comment:
   ditto


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r32805
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
 ##
 @@ -128,56 +136,65 @@ public ChaosMonkey build() {
 
   private void loadProperties() {
 
-  action1Period = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.PERIODIC_ACTION1_PERIOD,
-MonkeyConstants.DEFAULT_PERIODIC_ACTION1_PERIOD + ""));
-  action2Period = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.PERIODIC_ACTION2_PERIOD,
-MonkeyConstants.DEFAULT_PERIODIC_ACTION2_PERIOD + ""));
-  action3Period = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.COMPOSITE_ACTION3_PERIOD,
-MonkeyConstants.DEFAULT_COMPOSITE_ACTION3_PERIOD + ""));
-  action4Period = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.PERIODIC_ACTION4_PERIOD,
-MonkeyConstants.DEFAULT_PERIODIC_ACTION4_PERIOD + ""));
-  moveRegionsMaxTime = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.MOVE_REGIONS_MAX_TIME,
-MonkeyConstants.DEFAULT_MOVE_REGIONS_MAX_TIME + ""));
-  moveRegionsSleepTime = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.MOVE_REGIONS_SLEEP_TIME,
-MonkeyConstants.DEFAULT_MOVE_REGIONS_SLEEP_TIME + ""));
-  moveRandomRegionSleepTime = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.MOVE_RANDOM_REGION_SLEEP_TIME,
-MonkeyConstants.DEFAULT_MOVE_RANDOM_REGION_SLEEP_TIME + ""));
-  restartRandomRSSleepTime = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.RESTART_RANDOM_RS_SLEEP_TIME,
-MonkeyConstants.DEFAULT_RESTART_RANDOM_RS_SLEEP_TIME + ""));
-  batchRestartRSSleepTime = Long.parseLong(this.properties.getProperty(
-MonkeyConstants.BATCH_RESTART_RS_SLEEP_TIME,
-MonkeyConstants.DEFAULT_BATCH_RESTART_RS_SLEEP_TIME + ""));
-  batchRestartRSRatio = Float.parseFloat(this.properties.getProperty(
-MonkeyConstants.BATCH_RESTART_RS_RATIO,
-MonkeyConstants.DEFAULT_BATCH_RESTART_RS_RATIO + ""));
-  restartActiveMasterSleepTime = 
Long.parseLong(this.properties.getProperty(
-MonkeyConstants.RESTART_ACTIVE_MASTER_SLEEP_TIME,
-MonkeyConstants.DEFAULT_RESTART_ACTIVE_MASTER_SLEEP_TIME + ""));
-  rollingBatchRestartRSSleepTime = 
Long.parseLong(this.properties.getProperty(
+action1Period = Long.parseLong(this.properties.getProperty(
 
 Review comment:
   Please do not reformat unrelated lines.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hbase] petersomogyi commented on a change in pull request #592: HBASE-22982: region server suspend/resume and graceful rolling restart actions

2019-09-25 Thread GitBox
petersomogyi commented on a change in pull request #592: HBASE-22982: region 
server suspend/resume and graceful rolling restart actions
URL: https://github.com/apache/hbase/pull/592#discussion_r328015807
 
 

 ##
 File path: 
hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/Action.java
 ##
 @@ -150,78 +151,106 @@ public void perform() throws Exception { }
   }
 
   protected void killMaster(ServerName server) throws IOException {
-LOG.info("Killing master " + server);
+LOG.info("Killing master {}", server);
 cluster.killMaster(server);
 cluster.waitForMasterToStop(server, killMasterTimeout);
 LOG.info("Killed master " + server);
   }
 
   protected void startMaster(ServerName server) throws IOException {
-LOG.info("Starting master " + server.getHostname());
+LOG.info("Starting master {}", server.getHostname());
 cluster.startMaster(server.getHostname(), server.getPort());
 cluster.waitForActiveAndReadyMaster(startMasterTimeout);
 LOG.info("Started master " + server.getHostname());
   }
 
+  protected void stopRs(ServerName server) throws IOException {
+LOG.info("Stopping regionserver {}", server);
+cluster.stopRegionServer(server);
+cluster.waitForRegionServerToStop(server, killRsTimeout);
+LOG.info("Stoppiong regionserver {}. Reported num of rs:{}", server,
+cluster.getClusterMetrics().getLiveServerMetrics().size());
+  }
+
+  protected void suspendRs(ServerName server) throws IOException {
+LOG.info("Suspending regionserver {}", server);
+cluster.suspendRegionServer(server);
+if(!(cluster instanceof MiniHBaseCluster)){
 
 Review comment:
   What happens on non-minicluster?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services