I'm using 0.94.7 since I did not get the chance to deploye the last RC... I will wait for some more feedback regarding the option (delete or rename) and most probably will open a JIRA.
Regardeing recovered.editsI don't have this file anymore, but I just found another one which is blocking some other splits: hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls /hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/ Found 5 items drwxr-xr-x - hbase supergroup 0 2013-06-01 16:56 /hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/.oldlogs -rw-r--r-- 3 hbase supergroup 855 2013-06-01 16:56 /hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/.regioninfo drwxr-xr-x - hbase supergroup 0 2013-06-01 17:48 /hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/@ drwxr-xr-x - hbase supergroup 0 2013-06-01 16:56 /hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/a -rw-r--r-- 3 hbase supergroup 5375 2013-06-01 15:43 /hbase/work_proposed/4a45c6fed3578cd6f547c6fc58bad221/recovered.edits And the date/time seems to match when I faced the 2 power outages yesterday... JM 2013/6/2 Ted Yu <[email protected]>: > Can you search for 1d44b0630ed7785106a87a2bd4993551/recovered.edits to see > when it was created ? > Namenode log would be a good place to start with. > > bq. we can also rename it so if really required we can replay it later? > > The above is a better way of handling the situation. > > What version of HBase are you using ? > > Cheers > > On Sun, Jun 2, 2013 at 8:09 AM, Jean-Marc Spaggiari <[email protected] >> wrote: > >> My HBase was in a bad state recently. HBCK did a slow but good job and >> everything is now almost stable. However, I still have one log split >> which is not working. Every minute, the SplitLogManager try to split >> the log, fails, and retry. It's always the same file. It's assigned to >> different nodes, but all failed, and it's starting again and again. >> >> >> 2013-06-02 10:44:20,298 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of >> logs to split >> 2013-06-02 10:44:20,298 INFO >> org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs >> in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting] >> 2013-06-02 10:44:20,298 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: wait for status of >> task >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 >> to change to DELETED >> 2013-06-02 10:44:20,315 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: >> deleted >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 >> 2013-06-02 10:44:20,329 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task >> at znode >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 >> 2013-06-02 10:44:20,341 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task >> at znode >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 >> 2013-06-02 10:44:20,344 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired >> >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 >> ver = 0 >> 2013-06-02 10:44:20,346 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired >> >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 >> ver = 0 >> 2013-06-02 10:44:20,384 INFO >> org.apache.hadoop.hbase.master.SplitLogManager: task >> >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 >> acquired by node1,60020,1370136472290 >> 2013-06-02 10:44:20,410 INFO >> org.apache.hadoop.hbase.master.SplitLogManager: task >> >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 >> acquired by node4,60020,1370136467255 >> 2013-06-02 10:44:20,497 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184260384 last_version = 1 cur_worker_name = >> node1,60020,1370136472290 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node1,60020,1370136472290 is not marked as dead, we waited for >> 113 while the timeout is 300000 >> 2013-06-02 10:44:20,497 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184260410 last_version = 1 cur_worker_name = >> node4,60020,1370136467255 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node4,60020,1370136467255 is not marked as dead, we waited for >> 87 while the timeout is 300000 >> 2013-06-02 10:44:20,497 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 >> unassigned = 0 >> 2013-06-02 10:44:21,497 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261377 last_version = 2 cur_worker_name = >> node1,60020,1370136472290 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node1,60020,1370136472290 is not marked as dead, we waited for >> 120 while the timeout is 300000 >> 2013-06-02 10:44:21,497 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261410 last_version = 2 cur_worker_name = >> node4,60020,1370136467255 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node4,60020,1370136467255 is not marked as dead, we waited for >> 87 while the timeout is 300000 >> 2013-06-02 10:44:21,497 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 >> unassigned = 0 >> 2013-06-02 10:44:21,708 DEBUG >> org.apache.hadoop.hbase.master.ServerManager: REPORT: Server >> node7,60020,1370136467731 came back up, removed it from the dead >> servers list >> 2013-06-02 10:44:22,497 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261377 last_version = 2 cur_worker_name = >> node1,60020,1370136472290 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node1,60020,1370136472290 is not marked as dead, we waited for >> 1120 while the timeout is 300000 >> 2013-06-02 10:44:22,497 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261410 last_version = 2 cur_worker_name = >> node4,60020,1370136467255 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node4,60020,1370136467255 is not marked as dead, we waited for >> 1087 while the timeout is 300000 >> 2013-06-02 10:44:22,497 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 >> unassigned = 0 >> 2013-06-02 10:44:23,497 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261377 last_version = 2 cur_worker_name = >> node1,60020,1370136472290 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node1,60020,1370136472290 is not marked as dead, we waited for >> 2120 while the timeout is 300000 >> 2013-06-02 10:44:23,497 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261410 last_version = 2 cur_worker_name = >> node4,60020,1370136467255 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node4,60020,1370136467255 is not marked as dead, we waited for >> 2087 while the timeout is 300000 >> 2013-06-02 10:44:23,497 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 >> unassigned = 0 >> 2013-06-02 10:44:24,498 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261377 last_version = 2 cur_worker_name = >> node1,60020,1370136472290 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node1,60020,1370136472290 is not marked as dead, we waited for >> 3121 while the timeout is 300000 >> 2013-06-02 10:44:24,498 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261410 last_version = 2 cur_worker_name = >> node4,60020,1370136467255 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node4,60020,1370136467255 is not marked as dead, we waited for >> 3088 while the timeout is 300000 >> 2013-06-02 10:44:24,498 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 >> unassigned = 0 >> 2013-06-02 10:44:25,498 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261377 last_version = 2 cur_worker_name = >> node1,60020,1370136472290 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node1,60020,1370136472290 is not marked as dead, we waited for >> 4121 while the timeout is 300000 >> 2013-06-02 10:44:25,498 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261410 last_version = 2 cur_worker_name = >> node4,60020,1370136467255 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node4,60020,1370136467255 is not marked as dead, we waited for >> 4088 while the timeout is 300000 >> 2013-06-02 10:44:25,498 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 >> unassigned = 0 >> 2013-06-02 10:44:26,498 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261377 last_version = 2 cur_worker_name = >> node1,60020,1370136472290 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node1,60020,1370136472290 is not marked as dead, we waited for >> 5121 while the timeout is 300000 >> 2013-06-02 10:44:26,498 TRACE >> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit >> of last_update = 1370184261410 last_version = 2 cur_worker_name = >> node4,60020,1370136467255 status = in_progress incarnation = 0 >> resubmits = 0 batch = installed = 2 done = 0 error = 0 because the >> server node4,60020,1370136467255 is not marked as dead, we waited for >> 5088 while the timeout is 300000 >> 2013-06-02 10:44:26,498 DEBUG >> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 >> unassigned = 0 >> 2013-06-02 10:45:20,551 INFO >> org.apache.hadoop.hbase.master.SplitLogManager: task >> >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 >> entered state err node4,60020,1370136467255 >> 2013-06-02 10:45:20,551 WARN >> org.apache.hadoop.hbase.master.SplitLogManager: Error splitting >> >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 >> 2013-06-02 10:45:20,990 INFO >> org.apache.hadoop.hbase.master.SplitLogManager: task >> >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 >> entered state err node1,60020,1370136472290 >> 2013-06-02 10:45:20,990 WARN >> org.apache.hadoop.hbase.master.SplitLogManager: Error splitting >> >> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 >> 2013-06-02 10:45:20,990 WARN >> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting >> logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting] >> installed = 2 but only 0 done >> 2013-06-02 10:45:20,990 ERROR >> org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while >> processing event M_META_SERVER_SHUTDOWN >> java.io.IOException: failed log splitting for >> node7,60020,1370118961527, will retry >> at >> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:69) >> at >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> at java.lang.Thread.run(Thread.java:722) >> Caused by: java.io.IOException: error or interrupted while splitting >> logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting] >> Task = installed = 2 done = 0 error = 2 >> at >> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:299) >> at >> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:371) >> at >> org.apache.hadoop.hbase.master.MasterFileSystem.splitAllLogs(MasterFileSystem.java:341) >> at >> org.apache.hadoop.hbase.master.MasterFileSystem.splitAllLogs(MasterFileSystem.java:288) >> at >> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:63) >> ... 4 more >> 2013-06-02 10:45:20,990 INFO >> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler: >> Splitting all logs for node7,60020,1370118961527 >> >> >> >> I tried to look at the file: >> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls >> /hbase/.logs/node7,60020,1370118961527-splitting/ >> Found 2 items >> -rw-r--r-- 3 hbase supergroup 41578360 2013-06-01 17:36 >> >> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370122562614 >> -rw-r--r-- 3 hbase supergroup 18153457 2013-06-01 19:36 >> >> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370129764666 >> >> I also tried to get the files from HDFS: >> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -get >> >> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370122562614 >> . >> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -get >> >> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370129764666 >> . >> hadoop@node3:~/hadoop-1.0.3$ ll >> -rw-r--r-- 1 hadoop hadoop 41578360 jun 2 10:13 >> node7%2C60020%2C1370118961527.1370122562614 >> -rw-r--r-- 1 hadoop hadoop 18153457 jun 2 10:13 >> node7%2C60020%2C1370118961527.1370129764666 >> >> >> I looked into the code, and the issues is because "installed = 2 but >> only 0 done" (batch.done != batch.installed) which mean we are coming >> back from "waitForSplittingCompletion" even if the split is not really >> completed. The error message warning "Error splitting" is displayed >> into the "SplitLogManager.setDone" message but is not giving much >> information to track the root cause. >> >> So I looked into the regionservers, and found this: >> Caused by: org.apache.hadoop.ipc.RemoteException: >> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a >> directory: >> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/recovered.edits >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1170) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628) >> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:601) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:415) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) >> >> at org.apache.hadoop.ipc.Client.call(Client.java:1070) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) >> at $Proxy8.create(Unknown Source) >> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:601) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) >> at $Proxy8.create(Unknown Source) >> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248) >> ... 19 more >> >> So I looked in HDFS and found this: >> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls >> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/ >> Found 5 items >> drwxr-xr-x - hbase supergroup 0 2013-06-01 16:56 >> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/.oldlogs >> -rw-r--r-- 3 hbase supergroup 841 2013-06-01 16:56 >> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/.regioninfo >> drwxr-xr-x - hbase supergroup 0 2013-06-01 16:56 >> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/@ >> drwxr-xr-x - hbase supergroup 0 2013-06-01 16:56 >> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/a >> -rw-r--r-- 3 hbase supergroup 5159 2013-06-01 15:43 >> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/recovered.edits >> >> >> So, 2 things again here. >> >> 1) Should the region server send more information of the failure to >> the master the the master can display the failure cause on the logs? >> 2) recovered.edits should have not been there. And the failure is >> because the SplitLogWorker tries to create a recovered.edits file but >> it's already existing. >> >> This happens because my cluster failed twice in a row yesterday. And >> most probably the second time occurs while the logsplit was still in >> progress. Therefor, this file is just a left over. >> >> Now, what should we do. I can manually delete this file and the >> process will take over and everything will be fine, but should we >> update the HLogSplitter or HLogSplitter to validate this file is not >> there first and delete it if required? What's the risk of deleting >> this file automatically? There should not be any other process writing >> it, right? Worst case, we can also rename it so if really required we >> can replay it later? >> >> For my own needs, I removed the file manually and everything is >> running now (re-assigning regions, etc.). >> >> What's your opinion on that? Should we remove the file? Or should we >> not because it's to dangerous? >> >> JM >>
