[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits

2012-04-04 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5618:
---

Status: Open  (was: Patch Available)

 SplitLogManager - prevent unnecessary attempts to resubmits
 ---

 Key: HBASE-5618
 URL: https://issues.apache.org/jira/browse/HBASE-5618
 Project: HBase
  Issue Type: Improvement
  Components: wal, zookeeper
Reporter: Prakash Khemani
Assignee: Prakash Khemani
 Attachments: 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch


 Currently once a watch fires that the task node has been updated (hearbeated) 
 by the worker, the splitlogmanager still quite some time before it updates 
 the last heard from time. This is because the manager currently schedules 
 another getDataSetWatch() and only after that finishes will it update the 
 task's last heard from time.
 This leads to a large number of zk-BadVersion warnings when resubmission is 
 continuously attempted and it fails.
 Two changes should be made
 (1) On a resubmission failure because of BadVersion the task's lastUpdate 
 time should get upped.
 (2) The task's lastUpdate time should get upped as soon as the 
 nodeDataChanged() watch fires and without waiting for getDataSetWatch() to 
 complete.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits

2012-04-04 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5618:
---

Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch

re-attaching the same patch. I had cancelled it by mistake.

 SplitLogManager - prevent unnecessary attempts to resubmits
 ---

 Key: HBASE-5618
 URL: https://issues.apache.org/jira/browse/HBASE-5618
 Project: HBase
  Issue Type: Improvement
  Components: wal, zookeeper
Reporter: Prakash Khemani
Assignee: Prakash Khemani
 Attachments: 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch


 Currently once a watch fires that the task node has been updated (hearbeated) 
 by the worker, the splitlogmanager still quite some time before it updates 
 the last heard from time. This is because the manager currently schedules 
 another getDataSetWatch() and only after that finishes will it update the 
 task's last heard from time.
 This leads to a large number of zk-BadVersion warnings when resubmission is 
 continuously attempted and it fails.
 Two changes should be made
 (1) On a resubmission failure because of BadVersion the task's lastUpdate 
 time should get upped.
 (2) The task's lastUpdate time should get upped as soon as the 
 nodeDataChanged() watch fires and without waiting for getDataSetWatch() to 
 complete.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits

2012-04-04 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5618:
---

Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch

 SplitLogManager - prevent unnecessary attempts to resubmits
 ---

 Key: HBASE-5618
 URL: https://issues.apache.org/jira/browse/HBASE-5618
 Project: HBase
  Issue Type: Improvement
  Components: wal, zookeeper
Reporter: Prakash Khemani
Assignee: Prakash Khemani
 Attachments: 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch


 Currently once a watch fires that the task node has been updated (hearbeated) 
 by the worker, the splitlogmanager still quite some time before it updates 
 the last heard from time. This is because the manager currently schedules 
 another getDataSetWatch() and only after that finishes will it update the 
 task's last heard from time.
 This leads to a large number of zk-BadVersion warnings when resubmission is 
 continuously attempted and it fails.
 Two changes should be made
 (1) On a resubmission failure because of BadVersion the task's lastUpdate 
 time should get upped.
 (2) The task's lastUpdate time should get upped as soon as the 
 nodeDataChanged() watch fires and without waiting for getDataSetWatch() to 
 complete.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits

2012-04-03 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5618:
---

Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch

 SplitLogManager - prevent unnecessary attempts to resubmits
 ---

 Key: HBASE-5618
 URL: https://issues.apache.org/jira/browse/HBASE-5618
 Project: HBase
  Issue Type: Improvement
  Components: wal, zookeeper
Reporter: Prakash Khemani
Assignee: Prakash Khemani
 Attachments: 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch


 Currently once a watch fires that the task node has been updated (hearbeated) 
 by the worker, the splitlogmanager still quite some time before it updates 
 the last heard from time. This is because the manager currently schedules 
 another getDataSetWatch() and only after that finishes will it update the 
 task's last heard from time.
 This leads to a large number of zk-BadVersion warnings when resubmission is 
 continuously attempted and it fails.
 Two changes should be made
 (1) On a resubmission failure because of BadVersion the task's lastUpdate 
 time should get upped.
 (2) The task's lastUpdate time should get upped as soon as the 
 nodeDataChanged() watch fires and without waiting for getDataSetWatch() to 
 complete.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits

2012-04-03 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5618:
---

Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch

patch without directory prefixes

 SplitLogManager - prevent unnecessary attempts to resubmits
 ---

 Key: HBASE-5618
 URL: https://issues.apache.org/jira/browse/HBASE-5618
 Project: HBase
  Issue Type: Improvement
  Components: wal, zookeeper
Reporter: Prakash Khemani
Assignee: Prakash Khemani
 Attachments: 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch, 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch


 Currently once a watch fires that the task node has been updated (hearbeated) 
 by the worker, the splitlogmanager still quite some time before it updates 
 the last heard from time. This is because the manager currently schedules 
 another getDataSetWatch() and only after that finishes will it update the 
 task's last heard from time.
 This leads to a large number of zk-BadVersion warnings when resubmission is 
 continuously attempted and it fails.
 Two changes should be made
 (1) On a resubmission failure because of BadVersion the task's lastUpdate 
 time should get upped.
 (2) The task's lastUpdate time should get upped as soon as the 
 nodeDataChanged() watch fires and without waiting for getDataSetWatch() to 
 complete.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5606) SplitLogManger async delete node hangs log splitting when ZK connection is lost

2012-03-26 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5606:
---

Attachment: 0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch

Do not do any error processing if the getDataSetWatch() call from 
SplitLogManager timeoutMonitor fails

 SplitLogManger async delete node hangs log splitting when ZK connection is 
 lost 
 

 Key: HBASE-5606
 URL: https://issues.apache.org/jira/browse/HBASE-5606
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0
Reporter: Gopinathan A
Priority: Critical
 Fix For: 0.92.2

 Attachments: 
 0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch, 5606.txt


 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All tasks are failed due to ZK connection lost, so the all the tasks were 
 deleted asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. The asynchronously deletion in step 2 finally happened for new task
 5. This made the SplitLogManger in hanging state.
 This leads to .META. region not assigened for long time
 {noformat}
 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(55413,79):2012-03-14 
 19:28:47,932 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up 
 splitlog task at znode 
 /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89303,79):2012-03-14 
 19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up 
 splitlog task at znode 
 /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
 {noformat}
 {noformat}
 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(80417,99):2012-03-14 
 19:34:31,196 DEBUG 
 org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted 
 /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89456,99):2012-03-14 
 19:34:32,497 DEBUG 
 org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted 
 /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5618) SplitLogManager - prevent unnecessary attempts to resubmits

2012-03-26 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5618:
---

Attachment: 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch

update heartbeat time as soon as possible and as often as one can.

 SplitLogManager - prevent unnecessary attempts to resubmits
 ---

 Key: HBASE-5618
 URL: https://issues.apache.org/jira/browse/HBASE-5618
 Project: HBase
  Issue Type: Improvement
  Components: wal, zookeeper
Reporter: Prakash Khemani
 Attachments: 
 0001-HBASE-5618-SplitLogManager-prevent-unnecessary-attem.patch


 Currently once a watch fires that the task node has been updated (hearbeated) 
 by the worker, the splitlogmanager still quite some time before it updates 
 the last heard from time. This is because the manager currently schedules 
 another getDataSetWatch() and only after that finishes will it update the 
 task's last heard from time.
 This leads to a large number of zk-BadVersion warnings when resubmission is 
 continuously attempted and it fails.
 Two changes should be made
 (1) On a resubmission failure because of BadVersion the task's lastUpdate 
 time should get upped.
 (2) The task's lastUpdate time should get upped as soon as the 
 nodeDataChanged() watch fires and without waiting for getDataSetWatch() to 
 complete.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5519) Incorrect warning in splitlogmanager

2012-03-05 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5519:
---

Attachment: 0001-HBASE-5519-Incorrect-warning-in-splitlogmanager.patch

replace a log.warn() w/ a comment

 Incorrect warning in splitlogmanager
 

 Key: HBASE-5519
 URL: https://issues.apache.org/jira/browse/HBASE-5519
 Project: HBase
  Issue Type: Improvement
Reporter: Prakash Khemani
 Attachments: 
 0001-HBASE-5519-Incorrect-warning-in-splitlogmanager.patch


 because of recently added behavior - where the splitlogmanager timeout thread 
 get's data from zk node just to check that the zk node is there ... we might 
 have multiple watches firing without the task znode expiring.
 remove the poor warning message. (internally, there was an assert that failed 
 in Mikhail's tests)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5287) [89-fb] hbck can go into an infinite loop

2012-02-02 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5287:
---

Summary: [89-fb] hbck can go into an infinite loop  (was: hbck can go into 
an infinite loop)

will close this issue. Mikhail will be uploading the patch separately.

 [89-fb] hbck can go into an infinite loop
 -

 Key: HBASE-5287
 URL: https://issues.apache.org/jira/browse/HBASE-5287
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 HBaseFsckRepair.prompt() should check for -1 return value from 
 System.in.read()
 Only affects 0.89 release.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5287) hbck can go into an infinite loop

2012-01-27 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5287:
---

Summary: hbck can go into an infinite loop  (was: fsync can go into an 
infinite loop)

 hbck can go into an infinite loop
 -

 Key: HBASE-5287
 URL: https://issues.apache.org/jira/browse/HBASE-5287
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 HBaseFsckRepair.prompt() should check for -1 return value from 
 System.in.read()
 Only affects 0.89 release.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-05 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

a test added on top of Ted's last change.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 5081-deleteNode-with-while-loop.txt, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-04 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

implement feedback

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-04 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

set the interrupt flag on InterruptedException. Remove OrphanLogException 
handling for distributed log splitting.

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry

2012-01-03 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

 Distributed log splitting deleteNode races againsth splitLog retry 
 ---

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-03 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

I rebased and it rebased cleanly for me. Anyway, uploading the format-patch 
output again to see if it applies. 

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry

2012-01-03 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5081:
---

Attachment: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch

fix an (unrelated) test failure in TestSplitLogManager.testRescanCleanup()

 Distributed log splitting deleteNode races against splitLog retry 
 --

 Key: HBASE-5081
 URL: https://issues.apache.org/jira/browse/HBASE-5081
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 0.92.0, 0.94.0
Reporter: Jimmy Xiang
Assignee: Prakash Khemani
 Fix For: 0.92.0

 Attachments: 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 
 distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, 
 hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, 
 hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, 
 patch_for_92_v3.txt


 Recently, during 0.92 rc testing, we found distributed log splitting hangs 
 there forever.  Please see attached screen shot.
 I looked into it and here is what happened I think:
 1. One rs died, the servershutdownhandler found it out and started the 
 distributed log splitting;
 2. All three tasks failed, so the three tasks were deleted, asynchronously;
 3. Servershutdownhandler retried the log splitting;
 4. During the retrial, it created these three tasks again, and put them in a 
 hashmap (tasks);
 5. The asynchronously deletion in step 2 finally happened for one task, in 
 the callback, it removed one
 task in the hashmap;
 6. One of the newly submitted tasks' zookeeper watcher found out that task is 
 unassigned, and it is not
 in the hashmap, so it created a new orphan task.
 7.  All three tasks failed, but that task created in step 6 is an orphan so 
 the batch.err counter was one short,
 so the log splitting hangs there and keeps waiting for the last task to 
 finish which is never going to happen.
 So I think the problem is step 2.  The fix is to make deletion sync, instead 
 of async, so that the retry will have
 a clean start.
 Async deleteNode will mess up with split log retrial.  In extreme situation, 
 if async deleteNode doesn't happen
 soon enough, some node created during the retrial could be deleted.
 deleteNode should be sync.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5029) TestDistributedLogSplitting fails on occasion

2011-12-16 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-5029:
---

Attachment: 0001-HBASE-5029-jira-TestDistributedLogSplitting-fails-on.patch

 TestDistributedLogSplitting fails on occasion
 -

 Key: HBASE-5029
 URL: https://issues.apache.org/jira/browse/HBASE-5029
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Prakash Khemani
 Attachments: 
 0001-HBASE-5029-jira-TestDistributedLogSplitting-fails-on.patch, 
 HBASE-5029.D891.1.patch, HBASE-5029.D891.2.patch


 This is how it usually fails: 
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/lastCompletedBuild/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testWorkerAbort/
 Assigning mighty Prakash since he offered to take a looksee.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4721) Retain Delete Markers after Major Compaction

2011-11-07 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-4721:
---

Summary: Retain Delete Markers after Major Compaction  (was: Configurable 
TTL for Delete Markers)

I think all we need to do is to retain delete markers even after a major 
compaction.

Retaining delete markers beyond the ttl of regular KVs doesn't make sense to 
me. Say the regular KV TTL is 1 day. If we keep a delete marker alive for 2 
days then the only KVs that it is going to affect are the ones that are older 
than 2 days and are already expired.

If we just retain delete markers after every major compaction then the 
processing becomes exactly the same as in a minor compaction. Hopefully, will 
reduce some code complexity.

 Retain Delete Markers after Major Compaction
 

 Key: HBASE-4721
 URL: https://issues.apache.org/jira/browse/HBASE-4721
 Project: HBase
  Issue Type: New Feature
Reporter: Prakash Khemani
Assignee: Prakash Khemani

 There is a need to provide long TTLs for delete markers. This is useful when 
 replicating hbase logs from one cluster to another. The receiving cluster 
 shouldn't compact away the delete markers because the affected key-values 
 might still be on the way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4696) HRegionThriftServer' might have to indefinitely do redirtects

2011-10-28 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-4696:
---

Environment: 
HRegionThriftServer.getRowWithColumnsTs() redirects the request to the correct 
region server if it has landed on the wrong region-server. With this approach 
the smart-client will never get a NotServingRegionException and it will never 
be able to invalidate its cache. It will indefinitely send the request to the 
wrong region server and the wrong region server will always be redirecting it.

Either redirects should be turned off and the client should react to 
NotServingRegionExceptions.

Or somehow a flag should be set in the response telling the client to refresh 
its cache.
Summary: HRegionThriftServer' might have to indefinitely do redirtects  
(was: HRegionThriftServer)

 HRegionThriftServer' might have to indefinitely do redirtects
 -

 Key: HBASE-4696
 URL: https://issues.apache.org/jira/browse/HBASE-4696
 Project: HBase
  Issue Type: Bug
 Environment: HRegionThriftServer.getRowWithColumnsTs() redirects the 
 request to the correct region server if it has landed on the wrong 
 region-server. With this approach the smart-client will never get a 
 NotServingRegionException and it will never be able to invalidate its cache. 
 It will indefinitely send the request to the wrong region server and the 
 wrong region server will always be redirecting it.
 Either redirects should be turned off and the client should react to 
 NotServingRegionExceptions.
 Or somehow a flag should be set in the response telling the client to refresh 
 its cache.
Reporter: Prakash Khemani



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4696) HRegionThriftServer' might have to indefinitely do redirtects

2011-10-28 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-4696:
---

Description: 
HRegionThriftServer.getRowWithColumnsTs() redirects the request to the correct 
region server if it has landed on the wrong region-server. With this approach 
the smart-client will never get a NotServingRegionException and it will never 
be able to invalidate its cache. It will indefinitely send the request to the 
wrong region server and the wrong region server will always be redirecting it.

Either redirects should be turned off and the client should react to 
NotServingRegionExceptions.

Or somehow a flag should be set in the response telling the client to refresh 
its cache.
Environment: (was: HRegionThriftServer.getRowWithColumnsTs() redirects 
the request to the correct region server if it has landed on the wrong 
region-server. With this approach the smart-client will never get a 
NotServingRegionException and it will never be able to invalidate its cache. It 
will indefinitely send the request to the wrong region server and the wrong 
region server will always be redirecting it.

Either redirects should be turned off and the client should react to 
NotServingRegionExceptions.

Or somehow a flag should be set in the response telling the client to refresh 
its cache.)

 HRegionThriftServer' might have to indefinitely do redirtects
 -

 Key: HBASE-4696
 URL: https://issues.apache.org/jira/browse/HBASE-4696
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 HRegionThriftServer.getRowWithColumnsTs() redirects the request to the 
 correct region server if it has landed on the wrong region-server. With this 
 approach the smart-client will never get a NotServingRegionException and it 
 will never be able to invalidate its cache. It will indefinitely send the 
 request to the wrong region server and the wrong region server will always be 
 redirecting it.
 Either redirects should be turned off and the client should react to 
 NotServingRegionExceptions.
 Or somehow a flag should be set in the response telling the client to refresh 
 its cache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4687) regionserver may miss zk-heartbeats to master when replaying edits at region open

2011-10-28 Thread Prakash Khemani (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani updated HBASE-4687:
---

Attachment: 0001-HBASE-4687-regionserver-may-miss-zk-heartbeats-to-ma.patch

path attached

 regionserver may miss zk-heartbeats to master when replaying edits at region 
 open
 -

 Key: HBASE-4687
 URL: https://issues.apache.org/jira/browse/HBASE-4687
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani
Assignee: Prakash Khemani
 Attachments: 
 0001-HBASE-4687-regionserver-may-miss-zk-heartbeats-to-ma.patch


 replayRecoveredEdits() should do another reporter.progress() before returning.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira