[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-06 Thread Jimmy Xiang (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-5081:
---

Attachment: distributed_log_splitting_screenshot3.png

After restart, it still doesn't work.  See the attached 3rd screen shot.  

Probably we should commit this one and open a new Jira.

@Stack,  to reproduce it, you can set these properties and run bigtop 
TestLoadAndVerify: 

  property
namehbase.hregion.max.filesize/name
value1048576/value
  /property
  property
namehbase.master.distributed.log.splitting/name
valuetrue/value
  /property

  property
nameio.file.buffer.size/name
value131072/value
descriptionHadoop setting /description
  /property
  property
namehbase.balancer.period
/name
value2000/value
descriptionPeriod at which the region balancer runs in the Master.
/description
  /property
  property
namehbase.hregion.memstore.flush.size/name
value262144/value !-- 256KB --
  /property


 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, 
 distributed_log_splitting_screen_shot2.png, 
 distributed_log_splitting_screenshot3.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-06 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5081:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Will open new issue if other distributed log splitting bug is discovered.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, 
 distributed_log_splitting_screen_shot2.png, 
 distributed_log_splitting_screenshot3.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-05 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

a test added on top of Ted's last change.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-05 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5081:
--

Attachment: HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

The new test Prakash added passed.
I found an unrelated change in pom.xml

This patch removes that change.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-05 Thread Jimmy Xiang (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang updated HBASE-5081:
---

Attachment: distributed_log_splitting_screen_shot2.png

With the latest patch, I got something funny again.  Please see the screen shot 
attached.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, 
 distributed_log_splitting_screen_shot2.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-05 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5081:
--

Comment: was deleted

(was: -1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12509352/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/657//console

This message is automatically generated.)

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, 
 distributed_log_splitting_screen_shot2.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-05 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5081:
--

Comment: was deleted

(was: -1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12509653/distributed_log_splitting_screen_shot2.png
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/682//console

This message is automatically generated.)

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, 
 distributed_log_splitting_screen_shot2.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-04 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

implement feedback

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-04 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

set the interrupt flag on InterruptedException. Remove OrphanLogException 
handling for distributed log splitting.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-04 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5081:
--

Attachment: 5081-deleteNode-with-while-loop.txt

Patch based on Prakash's latest patch.
Changed 'if (oldtask.status == FAILURE)' to a while loop.
Also restored @param for the first SplitLogManager ctor

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-04 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5081:
--

Attachment: (was: 5081-deleteNode-with-while-loop.txt)

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-04 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5081:
--

Attachment: 5081-deleteNode-with-while-loop.txt

Patch with break statement in catch block.

TestDistributedLogSplitting and TestSplitLogManager pass with this patch.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-03 Thread Jean-Daniel Cryans (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated HBASE-5081:
--

Summary: Distributed log splitting deleteNode races against splitLog retry  
 (was: Distributed log splitting deleteNode races againsth splitLog retry )

Fixing the typo in the title.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-03 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

I rebased and it rebased cleanly for me. Anyway, uploading the format-patch 
output again to see if it applies. 

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-03 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5081:
--

Comment: was deleted

(was: -1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12509344/0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 8 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/656//console

This message is automatically generated.)

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-03 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

fix an (unrelated) test failure in TestSplitLogManager.testRescanCleanup()

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira