Can you search for 1d44b0630ed7785106a87a2bd4993551/recovered.edits to see when it was created ? Namenode log would be a good place to start with.
bq. we can also rename it so if really required we can replay it later? The above is a better way of handling the situation. What version of HBase are you using ? Cheers On Sun, Jun 2, 2013 at 8:09 AM, Jean-Marc Spaggiari <[email protected] > wrote: > My HBase was in a bad state recently. HBCK did a slow but good job and > everything is now almost stable. However, I still have one log split > which is not working. Every minute, the SplitLogManager try to split > the log, fails, and retry. It's always the same file. It's assigned to > different nodes, but all failed, and it's starting again and again. > > > 2013-06-02 10:44:20,298 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of > logs to split > 2013-06-02 10:44:20,298 INFO > org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs > in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting] > 2013-06-02 10:44:20,298 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: wait for status of > task > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 > to change to DELETED > 2013-06-02 10:44:20,315 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: > deleted > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 > 2013-06-02 10:44:20,329 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task > at znode > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 > 2013-06-02 10:44:20,341 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task > at znode > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 > 2013-06-02 10:44:20,344 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired > > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 > ver = 0 > 2013-06-02 10:44:20,346 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired > > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 > ver = 0 > 2013-06-02 10:44:20,384 INFO > org.apache.hadoop.hbase.master.SplitLogManager: task > > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 > acquired by node1,60020,1370136472290 > 2013-06-02 10:44:20,410 INFO > org.apache.hadoop.hbase.master.SplitLogManager: task > > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 > acquired by node4,60020,1370136467255 > 2013-06-02 10:44:20,497 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184260384 last_version = 1 cur_worker_name = > node1,60020,1370136472290 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node1,60020,1370136472290 is not marked as dead, we waited for > 113 while the timeout is 300000 > 2013-06-02 10:44:20,497 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184260410 last_version = 1 cur_worker_name = > node4,60020,1370136467255 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node4,60020,1370136467255 is not marked as dead, we waited for > 87 while the timeout is 300000 > 2013-06-02 10:44:20,497 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 > unassigned = 0 > 2013-06-02 10:44:21,497 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261377 last_version = 2 cur_worker_name = > node1,60020,1370136472290 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node1,60020,1370136472290 is not marked as dead, we waited for > 120 while the timeout is 300000 > 2013-06-02 10:44:21,497 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261410 last_version = 2 cur_worker_name = > node4,60020,1370136467255 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node4,60020,1370136467255 is not marked as dead, we waited for > 87 while the timeout is 300000 > 2013-06-02 10:44:21,497 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 > unassigned = 0 > 2013-06-02 10:44:21,708 DEBUG > org.apache.hadoop.hbase.master.ServerManager: REPORT: Server > node7,60020,1370136467731 came back up, removed it from the dead > servers list > 2013-06-02 10:44:22,497 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261377 last_version = 2 cur_worker_name = > node1,60020,1370136472290 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node1,60020,1370136472290 is not marked as dead, we waited for > 1120 while the timeout is 300000 > 2013-06-02 10:44:22,497 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261410 last_version = 2 cur_worker_name = > node4,60020,1370136467255 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node4,60020,1370136467255 is not marked as dead, we waited for > 1087 while the timeout is 300000 > 2013-06-02 10:44:22,497 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 > unassigned = 0 > 2013-06-02 10:44:23,497 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261377 last_version = 2 cur_worker_name = > node1,60020,1370136472290 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node1,60020,1370136472290 is not marked as dead, we waited for > 2120 while the timeout is 300000 > 2013-06-02 10:44:23,497 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261410 last_version = 2 cur_worker_name = > node4,60020,1370136467255 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node4,60020,1370136467255 is not marked as dead, we waited for > 2087 while the timeout is 300000 > 2013-06-02 10:44:23,497 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 > unassigned = 0 > 2013-06-02 10:44:24,498 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261377 last_version = 2 cur_worker_name = > node1,60020,1370136472290 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node1,60020,1370136472290 is not marked as dead, we waited for > 3121 while the timeout is 300000 > 2013-06-02 10:44:24,498 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261410 last_version = 2 cur_worker_name = > node4,60020,1370136467255 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node4,60020,1370136467255 is not marked as dead, we waited for > 3088 while the timeout is 300000 > 2013-06-02 10:44:24,498 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 > unassigned = 0 > 2013-06-02 10:44:25,498 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261377 last_version = 2 cur_worker_name = > node1,60020,1370136472290 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node1,60020,1370136472290 is not marked as dead, we waited for > 4121 while the timeout is 300000 > 2013-06-02 10:44:25,498 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261410 last_version = 2 cur_worker_name = > node4,60020,1370136467255 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node4,60020,1370136467255 is not marked as dead, we waited for > 4088 while the timeout is 300000 > 2013-06-02 10:44:25,498 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 > unassigned = 0 > 2013-06-02 10:44:26,498 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261377 last_version = 2 cur_worker_name = > node1,60020,1370136472290 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node1,60020,1370136472290 is not marked as dead, we waited for > 5121 while the timeout is 300000 > 2013-06-02 10:44:26,498 TRACE > org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit > of last_update = 1370184261410 last_version = 2 cur_worker_name = > node4,60020,1370136467255 status = in_progress incarnation = 0 > resubmits = 0 batch = installed = 2 done = 0 error = 0 because the > server node4,60020,1370136467255 is not marked as dead, we waited for > 5088 while the timeout is 300000 > 2013-06-02 10:44:26,498 DEBUG > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2 > unassigned = 0 > 2013-06-02 10:45:20,551 INFO > org.apache.hadoop.hbase.master.SplitLogManager: task > > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 > entered state err node4,60020,1370136467255 > 2013-06-02 10:45:20,551 WARN > org.apache.hadoop.hbase.master.SplitLogManager: Error splitting > > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666 > 2013-06-02 10:45:20,990 INFO > org.apache.hadoop.hbase.master.SplitLogManager: task > > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 > entered state err node1,60020,1370136472290 > 2013-06-02 10:45:20,990 WARN > org.apache.hadoop.hbase.master.SplitLogManager: Error splitting > > /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614 > 2013-06-02 10:45:20,990 WARN > org.apache.hadoop.hbase.master.SplitLogManager: error while splitting > logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting] > installed = 2 but only 0 done > 2013-06-02 10:45:20,990 ERROR > org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while > processing event M_META_SERVER_SHUTDOWN > java.io.IOException: failed log splitting for > node7,60020,1370118961527, will retry > at > org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:69) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > Caused by: java.io.IOException: error or interrupted while splitting > logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting] > Task = installed = 2 done = 0 error = 2 > at > org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:299) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:371) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitAllLogs(MasterFileSystem.java:341) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitAllLogs(MasterFileSystem.java:288) > at > org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:63) > ... 4 more > 2013-06-02 10:45:20,990 INFO > org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler: > Splitting all logs for node7,60020,1370118961527 > > > > I tried to look at the file: > hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls > /hbase/.logs/node7,60020,1370118961527-splitting/ > Found 2 items > -rw-r--r-- 3 hbase supergroup 41578360 2013-06-01 17:36 > > /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370122562614 > -rw-r--r-- 3 hbase supergroup 18153457 2013-06-01 19:36 > > /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370129764666 > > I also tried to get the files from HDFS: > hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -get > > /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370122562614 > . > hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -get > > /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370129764666 > . > hadoop@node3:~/hadoop-1.0.3$ ll > -rw-r--r-- 1 hadoop hadoop 41578360 jun 2 10:13 > node7%2C60020%2C1370118961527.1370122562614 > -rw-r--r-- 1 hadoop hadoop 18153457 jun 2 10:13 > node7%2C60020%2C1370118961527.1370129764666 > > > I looked into the code, and the issues is because "installed = 2 but > only 0 done" (batch.done != batch.installed) which mean we are coming > back from "waitForSplittingCompletion" even if the split is not really > completed. The error message warning "Error splitting" is displayed > into the "SplitLogManager.setDone" message but is not giving much > information to track the root cause. > > So I looked into the regionservers, and found this: > Caused by: org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: > /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/recovered.edits > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1170) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) > > at org.apache.hadoop.ipc.Client.call(Client.java:1070) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) > at $Proxy8.create(Unknown Source) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > at $Proxy8.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248) > ... 19 more > > So I looked in HDFS and found this: > hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls > /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/ > Found 5 items > drwxr-xr-x - hbase supergroup 0 2013-06-01 16:56 > /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/.oldlogs > -rw-r--r-- 3 hbase supergroup 841 2013-06-01 16:56 > /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/.regioninfo > drwxr-xr-x - hbase supergroup 0 2013-06-01 16:56 > /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/@ > drwxr-xr-x - hbase supergroup 0 2013-06-01 16:56 > /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/a > -rw-r--r-- 3 hbase supergroup 5159 2013-06-01 15:43 > /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/recovered.edits > > > So, 2 things again here. > > 1) Should the region server send more information of the failure to > the master the the master can display the failure cause on the logs? > 2) recovered.edits should have not been there. And the failure is > because the SplitLogWorker tries to create a recovered.edits file but > it's already existing. > > This happens because my cluster failed twice in a row yesterday. And > most probably the second time occurs while the logsplit was still in > progress. Therefor, this file is just a left over. > > Now, what should we do. I can manually delete this file and the > process will take over and everything will be fine, but should we > update the HLogSplitter or HLogSplitter to validate this file is not > there first and delete it if required? What's the risk of deleting > this file automatically? There should not be any other process writing > it, right? Worst case, we can also rename it so if really required we > can replay it later? > > For my own needs, I removed the file manually and everything is > running now (re-assigning regions, etc.). > > What's your opinion on that? Should we remove the file? Or should we > not because it's to dangerous? > > JM >
