[GitHub] [hadoop] bshashikant commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-14 Thread GitBox
bshashikant commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r313780023
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -265,6 +269,13 @@ public void persistContainerSet(OutputStream out) throws 
IOException {
   public long takeSnapshot() throws IOException {
 TermIndex ti = getLastAppliedTermIndex();
 long startTime = Time.monotonicNow();
+if (!isStateMachineHealthy.get()) {
+  String msg =
+  "Failed to take snapshot " + " for " + gid + " as the stateMachine"
+  + " is unhealthy. The last applied index is at " + ti;
 
 Review comment:
   Addressed in the latest patch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bshashikant commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-14 Thread GitBox
bshashikant commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r313780014
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -674,30 +681,60 @@ public void notifyIndexUpdate(long term, long index) {
   if (cmdType == Type.WriteChunk || cmdType ==Type.PutSmallFile) {
 builder.setCreateContainerSet(createContainerSet);
   }
+  CompletableFuture applyTransactionFuture =
+  new CompletableFuture<>();
   // Ensure the command gets executed in a separate thread than
   // stateMachineUpdater thread which is calling applyTransaction here.
-  CompletableFuture future = CompletableFuture
-  .supplyAsync(() -> runCommand(requestProto, builder.build()),
+  CompletableFuture future =
+  CompletableFuture.supplyAsync(
+  () -> runCommand(requestProto, builder.build()),
   getCommandExecutor(requestProto));
-
-  future.thenAccept(m -> {
+  future.thenApply(r -> {
 if (trx.getServerRole() == RaftPeerRole.LEADER) {
   long startTime = (long) trx.getStateMachineContext();
   metrics.incPipelineLatency(cmdType,
   Time.monotonicNowNanos() - startTime);
 }
-
-final Long previous =
-applyTransactionCompletionMap
-.put(index, trx.getLogEntry().getTerm());
-Preconditions.checkState(previous == null);
-if (cmdType == Type.WriteChunk || cmdType == Type.PutSmallFile) {
-  metrics.incNumBytesCommittedCount(
+if (r.getResult() != ContainerProtos.Result.SUCCESS) {
+  StorageContainerException sce =
+  new StorageContainerException(r.getMessage(), r.getResult());
+  LOG.error(
+  "gid {} : ApplyTransaction failed. cmd {} logIndex {} msg : "
+  + "{} Container Result: {}", gid, r.getCmdType(), index,
+  r.getMessage(), r.getResult());
+  metrics.incNumApplyTransactionsFails();
+  ratisServer.handleApplyTransactionFailure(gid, trx.getServerRole());
+  // Since the applyTransaction now is completed exceptionally,
+  // before any further snapshot is taken , the exception will be
+  // caught in stateMachineUpdater in Ratis and ratis server will
+  // shutdown.
+  applyTransactionFuture.completeExceptionally(sce);
 
 Review comment:
   Addressed in the latest patch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bshashikant commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-13 Thread GitBox
bshashikant commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r313234077
 
 

 ##
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java
 ##
 @@ -270,4 +279,73 @@ public void testUnhealthyContainer() throws Exception {
 Assert.assertEquals(ContainerProtos.Result.CONTAINER_UNHEALTHY,
 dispatcher.dispatch(request.build(), null).getResult());
   }
+
+  @Test
+  public void testAppyTransactionFailure() throws Exception {
+OzoneOutputStream key =
+objectStore.getVolume(volumeName).getBucket(bucketName)
+.createKey("ratis", 1024, ReplicationType.RATIS,
+ReplicationFactor.ONE, new HashMap<>());
+// First write and flush creates a container in the datanode
+key.write("ratis".getBytes());
+key.flush();
+key.write("ratis".getBytes());
+
+//get the name of a valid container
+OmKeyArgs keyArgs = new OmKeyArgs.Builder().setVolumeName(volumeName).
+setBucketName(bucketName).setType(HddsProtos.ReplicationType.RATIS)
+.setFactor(HddsProtos.ReplicationFactor.ONE).setKeyName("ratis")
+.build();
+KeyOutputStream groupOutputStream = (KeyOutputStream) 
key.getOutputStream();
+List locationInfoList =
+groupOutputStream.getLocationInfoList();
+Assert.assertEquals(1, locationInfoList.size());
+OmKeyLocationInfo omKeyLocationInfo = locationInfoList.get(0);
+ContainerData containerData =
+cluster.getHddsDatanodes().get(0).getDatanodeStateMachine()
+.getContainer().getContainerSet()
+.getContainer(omKeyLocationInfo.getContainerID())
+.getContainerData();
+Assert.assertTrue(containerData instanceof KeyValueContainerData);
+KeyValueContainerData keyValueContainerData =
+(KeyValueContainerData) containerData;
+key.close();
+
+long containerID = omKeyLocationInfo.getContainerID();
+// delete the container db file
+FileUtil.fullyDelete(new File(keyValueContainerData.getContainerPath()));
+Pipeline pipeline = cluster.getStorageContainerLocationClient()
+.getContainerWithPipeline(containerID).getPipeline();
+XceiverClientSpi client = xceiverClientManager.acquireClient(pipeline);
+ContainerProtos.ContainerCommandRequestProto.Builder request =
 
 Review comment:
   The idea is to execute a transaction on the same container. If we write more 
data , it can potentially go a new container altogether.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bshashikant commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
bshashikant commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311346307
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
 ##
 @@ -609,6 +609,16 @@ void handleNoLeader(RaftGroupId groupId, RoleInfoProto 
roleInfoProto) {
 handlePipelineFailure(groupId, roleInfoProto);
   }
 
+  void handleApplyTransactionFailure(RaftGroupId groupId,
+  RaftProtos.RaftPeerRole role) {
+UUID dnId = RatisHelper.toDatanodeId(getServer().getId());
+String msg =
+"Ratis Transaction failure in datanode" + dnId + " with role " + role
++ " Triggering pipeline close action.";
+triggerPipelineClose(groupId, msg, 
ClosePipelineInfo.Reason.PIPELINE_FAILED,
+false);
+stop();
 
 Review comment:
   As far as i know from previous discussions , the decision was to not take 
any other transactions on this pipeline at all and kill the RaftServerImpl 
instance. Any deviation from that conclusion?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bshashikant commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
bshashikant commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311072758
 
 

 ##
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java
 ##
 @@ -270,4 +279,73 @@ public void testUnhealthyContainer() throws Exception {
 Assert.assertEquals(ContainerProtos.Result.CONTAINER_UNHEALTHY,
 dispatcher.dispatch(request.build(), null).getResult());
   }
+
+  @Test
+  public void testAppyTransactionFailure() throws Exception {
+OzoneOutputStream key =
+objectStore.getVolume(volumeName).getBucket(bucketName)
+.createKey("ratis", 1024, ReplicationType.RATIS,
+ReplicationFactor.ONE, new HashMap<>());
+// First write and flush creates a container in the datanode
+key.write("ratis".getBytes());
+key.flush();
+key.write("ratis".getBytes());
+
+//get the name of a valid container
+OmKeyArgs keyArgs = new OmKeyArgs.Builder().setVolumeName(volumeName).
+setBucketName(bucketName).setType(HddsProtos.ReplicationType.RATIS)
+.setFactor(HddsProtos.ReplicationFactor.ONE).setKeyName("ratis")
+.build();
+KeyOutputStream groupOutputStream = (KeyOutputStream) 
key.getOutputStream();
+List locationInfoList =
+groupOutputStream.getLocationInfoList();
+Assert.assertEquals(1, locationInfoList.size());
+OmKeyLocationInfo omKeyLocationInfo = locationInfoList.get(0);
+ContainerData containerData =
+cluster.getHddsDatanodes().get(0).getDatanodeStateMachine()
+.getContainer().getContainerSet()
+.getContainer(omKeyLocationInfo.getContainerID())
+.getContainerData();
+Assert.assertTrue(containerData instanceof KeyValueContainerData);
+KeyValueContainerData keyValueContainerData =
+(KeyValueContainerData) containerData;
+key.close();
+
+long containerID = omKeyLocationInfo.getContainerID();
+// delete the container db file
+FileUtil.fullyDelete(new File(keyValueContainerData.getContainerPath()));
+Pipeline pipeline = cluster.getStorageContainerLocationClient()
+.getContainerWithPipeline(containerID).getPipeline();
+XceiverClientSpi client = xceiverClientManager.acquireClient(pipeline);
+ContainerProtos.ContainerCommandRequestProto.Builder request =
+ContainerProtos.ContainerCommandRequestProto.newBuilder();
+request.setDatanodeUuid(pipeline.getFirstNode().getUuidString());
+request.setCmdType(ContainerProtos.Type.CloseContainer);
+request.setContainerID(containerID);
+request.setCloseContainer(
+ContainerProtos.CloseContainerRequestProto.getDefaultInstance());
+// close container transaction will fail over Ratis and will cause the raft
+try {
+  client.sendCommand(request.build());
+  Assert.fail("Expected exception not thrown");
+} catch (IOException e) {
+}
+
+// Make sure the container is marked unhealthy
+Assert.assertTrue(
+cluster.getHddsDatanodes().get(0).getDatanodeStateMachine()
+.getContainer().getContainerSet().getContainer(containerID)
+.getContainerState()
+== ContainerProtos.ContainerDataProto.State.UNHEALTHY);
+XceiverServerRatis raftServer = (XceiverServerRatis)
+cluster.getHddsDatanodes().get(0).getDatanodeStateMachine()
+.getContainer().getWriteChannel();
+Assert.assertTrue(raftServer.isClosed());
 
 Review comment:
   will address in the next patch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bshashikant commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
bshashikant commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311072621
 
 

 ##
 File path: 
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java
 ##
 @@ -270,4 +279,73 @@ public void testUnhealthyContainer() throws Exception {
 Assert.assertEquals(ContainerProtos.Result.CONTAINER_UNHEALTHY,
 dispatcher.dispatch(request.build(), null).getResult());
   }
+
+  @Test
+  public void testAppyTransactionFailure() throws Exception {
+OzoneOutputStream key =
+objectStore.getVolume(volumeName).getBucket(bucketName)
+.createKey("ratis", 1024, ReplicationType.RATIS,
+ReplicationFactor.ONE, new HashMap<>());
+// First write and flush creates a container in the datanode
+key.write("ratis".getBytes());
+key.flush();
+key.write("ratis".getBytes());
+
+//get the name of a valid container
+OmKeyArgs keyArgs = new OmKeyArgs.Builder().setVolumeName(volumeName).
+setBucketName(bucketName).setType(HddsProtos.ReplicationType.RATIS)
+.setFactor(HddsProtos.ReplicationFactor.ONE).setKeyName("ratis")
+.build();
+KeyOutputStream groupOutputStream = (KeyOutputStream) 
key.getOutputStream();
+List locationInfoList =
+groupOutputStream.getLocationInfoList();
+Assert.assertEquals(1, locationInfoList.size());
+OmKeyLocationInfo omKeyLocationInfo = locationInfoList.get(0);
+ContainerData containerData =
+cluster.getHddsDatanodes().get(0).getDatanodeStateMachine()
+.getContainer().getContainerSet()
+.getContainer(omKeyLocationInfo.getContainerID())
+.getContainerData();
+Assert.assertTrue(containerData instanceof KeyValueContainerData);
+KeyValueContainerData keyValueContainerData =
+(KeyValueContainerData) containerData;
+key.close();
+
+long containerID = omKeyLocationInfo.getContainerID();
+// delete the container db file
+FileUtil.fullyDelete(new File(keyValueContainerData.getContainerPath()));
+Pipeline pipeline = cluster.getStorageContainerLocationClient()
+.getContainerWithPipeline(containerID).getPipeline();
+XceiverClientSpi client = xceiverClientManager.acquireClient(pipeline);
+ContainerProtos.ContainerCommandRequestProto.Builder request =
+ContainerProtos.ContainerCommandRequestProto.newBuilder();
+request.setDatanodeUuid(pipeline.getFirstNode().getUuidString());
+request.setCmdType(ContainerProtos.Type.CloseContainer);
+request.setContainerID(containerID);
+request.setCloseContainer(
+ContainerProtos.CloseContainerRequestProto.getDefaultInstance());
+// close container transaction will fail over Ratis and will cause the raft
+try {
+  client.sendCommand(request.build());
+  Assert.fail("Expected exception not thrown");
+} catch (IOException e) {
+}
 
 Review comment:
   Will address in the next patch..


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bshashikant commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
bshashikant commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311072353
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
 ##
 @@ -609,6 +609,16 @@ void handleNoLeader(RaftGroupId groupId, RoleInfoProto 
roleInfoProto) {
 handlePipelineFailure(groupId, roleInfoProto);
   }
 
+  void handleApplyTransactionFailure(RaftGroupId groupId,
+  RaftProtos.RaftPeerRole role) {
+UUID dnId = RatisHelper.toDatanodeId(getServer().getId());
+String msg =
+"Ratis Transaction failure in datanode" + dnId + " with role " + role
++ " Triggering pipeline close action.";
+triggerPipelineClose(groupId, msg, 
ClosePipelineInfo.Reason.PIPELINE_FAILED,
 
 Review comment:
   I think, the msg will differentiate what was the cause of the error. The 
reason code is just for SCM to take action of closing the pipeline. I don't 
think possibly SCM needs to differentiate its behaviour depending on why the 
pipelien failed.
   
   If required, we can add it in a separate jira as it needs to change for 
other reasons of pipeline failure.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] bshashikant commented on a change in pull request #1226: HDDS-1610. applyTransaction failure should not be lost on restart.

2019-08-06 Thread GitBox
bshashikant commented on a change in pull request #1226: HDDS-1610. 
applyTransaction failure should not be lost on restart.
URL: https://github.com/apache/hadoop/pull/1226#discussion_r311070601
 
 

 ##
 File path: 
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
 ##
 @@ -674,30 +674,54 @@ public void notifyIndexUpdate(long term, long index) {
   if (cmdType == Type.WriteChunk || cmdType ==Type.PutSmallFile) {
 builder.setCreateContainerSet(createContainerSet);
   }
+  CompletableFuture applyTransactionFuture =
+  new CompletableFuture<>();
   // Ensure the command gets executed in a separate thread than
   // stateMachineUpdater thread which is calling applyTransaction here.
-  CompletableFuture future = CompletableFuture
-  .supplyAsync(() -> runCommand(requestProto, builder.build()),
+  CompletableFuture future =
+  CompletableFuture.supplyAsync(
+  () -> runCommandGetResponse(requestProto, builder.build()),
   getCommandExecutor(requestProto));
-
-  future.thenAccept(m -> {
+  future.thenApply(r -> {
 if (trx.getServerRole() == RaftPeerRole.LEADER) {
   long startTime = (long) trx.getStateMachineContext();
   metrics.incPipelineLatency(cmdType,
   Time.monotonicNowNanos() - startTime);
 }
-
-final Long previous =
-applyTransactionCompletionMap
-.put(index, trx.getLogEntry().getTerm());
-Preconditions.checkState(previous == null);
-if (cmdType == Type.WriteChunk || cmdType == Type.PutSmallFile) {
-  metrics.incNumBytesCommittedCount(
+if (r.getResult() != ContainerProtos.Result.SUCCESS) {
+  StorageContainerException sce =
+  new StorageContainerException(r.getMessage(), r.getResult());
+  LOG.error(gid + ": ApplyTransaction failed: cmd " + r.getCmdType()
 
 Review comment:
   Container Id will be present in the Response Message. Will add that to the 
logger output.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org