Re: Never ending distributed log split

Ted Yu Sun, 02 Jun 2013 08:48:31 -0700

Can you search for 1d44b0630ed7785106a87a2bd4993551/recovered.edits to see
when it was created ?
Namenode log would be a good place to start with.


bq. we can also rename it so if really required we can replay it later?

The above is a better way of handling the situation.

What version of HBase are you using ?

Cheers

On Sun, Jun 2, 2013 at 8:09 AM, Jean-Marc Spaggiari <[email protected]
> wrote:

> My HBase was in a bad state recently. HBCK did a slow but good job and
> everything is now almost stable. However, I still have one log split
> which is not working. Every minute, the SplitLogManager try to split
> the log, fails, and retry. It's always the same file. It's assigned to
> different nodes, but all failed, and it's starting again and again.
>
>
> 2013-06-02 10:44:20,298 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
> logs to split
> 2013-06-02 10:44:20,298 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs
> in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting]
> 2013-06-02 10:44:20,298 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: wait for status of
> task
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> to change to DELETED
> 2013-06-02 10:44:20,315 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback:
> deleted
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> 2013-06-02 10:44:20,329 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task
> at znode
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> 2013-06-02 10:44:20,341 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task
> at znode
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
> 2013-06-02 10:44:20,344 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> ver = 0
> 2013-06-02 10:44:20,346 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: task not yet acquired
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
> ver = 0
> 2013-06-02 10:44:20,384 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: task
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> acquired by node1,60020,1370136472290
> 2013-06-02 10:44:20,410 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: task
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
> acquired by node4,60020,1370136467255
> 2013-06-02 10:44:20,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184260384 last_version = 1 cur_worker_name =
> node1,60020,1370136472290 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node1,60020,1370136472290 is not marked as dead, we waited for
> 113 while the timeout is 300000
> 2013-06-02 10:44:20,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184260410 last_version = 1 cur_worker_name =
> node4,60020,1370136467255 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node4,60020,1370136467255 is not marked as dead, we waited for
> 87 while the timeout is 300000
> 2013-06-02 10:44:20,497 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
> unassigned = 0
> 2013-06-02 10:44:21,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261377 last_version = 2 cur_worker_name =
> node1,60020,1370136472290 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node1,60020,1370136472290 is not marked as dead, we waited for
> 120 while the timeout is 300000
> 2013-06-02 10:44:21,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261410 last_version = 2 cur_worker_name =
> node4,60020,1370136467255 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node4,60020,1370136467255 is not marked as dead, we waited for
> 87 while the timeout is 300000
> 2013-06-02 10:44:21,497 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
> unassigned = 0
> 2013-06-02 10:44:21,708 DEBUG
> org.apache.hadoop.hbase.master.ServerManager: REPORT: Server
> node7,60020,1370136467731 came back up, removed it from the dead
> servers list
> 2013-06-02 10:44:22,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261377 last_version = 2 cur_worker_name =
> node1,60020,1370136472290 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node1,60020,1370136472290 is not marked as dead, we waited for
> 1120 while the timeout is 300000
> 2013-06-02 10:44:22,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261410 last_version = 2 cur_worker_name =
> node4,60020,1370136467255 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node4,60020,1370136467255 is not marked as dead, we waited for
> 1087 while the timeout is 300000
> 2013-06-02 10:44:22,497 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
> unassigned = 0
> 2013-06-02 10:44:23,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261377 last_version = 2 cur_worker_name =
> node1,60020,1370136472290 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node1,60020,1370136472290 is not marked as dead, we waited for
> 2120 while the timeout is 300000
> 2013-06-02 10:44:23,497 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261410 last_version = 2 cur_worker_name =
> node4,60020,1370136467255 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node4,60020,1370136467255 is not marked as dead, we waited for
> 2087 while the timeout is 300000
> 2013-06-02 10:44:23,497 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
> unassigned = 0
> 2013-06-02 10:44:24,498 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261377 last_version = 2 cur_worker_name =
> node1,60020,1370136472290 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node1,60020,1370136472290 is not marked as dead, we waited for
> 3121 while the timeout is 300000
> 2013-06-02 10:44:24,498 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261410 last_version = 2 cur_worker_name =
> node4,60020,1370136467255 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node4,60020,1370136467255 is not marked as dead, we waited for
> 3088 while the timeout is 300000
> 2013-06-02 10:44:24,498 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
> unassigned = 0
> 2013-06-02 10:44:25,498 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261377 last_version = 2 cur_worker_name =
> node1,60020,1370136472290 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node1,60020,1370136472290 is not marked as dead, we waited for
> 4121 while the timeout is 300000
> 2013-06-02 10:44:25,498 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261410 last_version = 2 cur_worker_name =
> node4,60020,1370136467255 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node4,60020,1370136467255 is not marked as dead, we waited for
> 4088 while the timeout is 300000
> 2013-06-02 10:44:25,498 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
> unassigned = 0
> 2013-06-02 10:44:26,498 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261377 last_version = 2 cur_worker_name =
> node1,60020,1370136472290 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node1,60020,1370136472290 is not marked as dead, we waited for
> 5121 while the timeout is 300000
> 2013-06-02 10:44:26,498 TRACE
> org.apache.hadoop.hbase.master.SplitLogManager: Skipping the resubmit
> of last_update = 1370184261410 last_version = 2 cur_worker_name =
> node4,60020,1370136467255 status = in_progress incarnation = 0
> resubmits = 0 batch = installed = 2 done = 0 error = 0  because the
> server node4,60020,1370136467255 is not marked as dead, we waited for
> 5088 while the timeout is 300000
> 2013-06-02 10:44:26,498 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 2
> unassigned = 0
> 2013-06-02 10:45:20,551 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: task
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
> entered state err node4,60020,1370136467255
> 2013-06-02 10:45:20,551 WARN
> org.apache.hadoop.hbase.master.SplitLogManager: Error splitting
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370129764666
> 2013-06-02 10:45:20,990 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: task
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> entered state err node1,60020,1370136472290
> 2013-06-02 10:45:20,990 WARN
> org.apache.hadoop.hbase.master.SplitLogManager: Error splitting
>
> /hbase/splitlog/hdfs%3A%2F%2Fnode3%3A9000%2Fhbase%2F.logs%2Fnode7%2C60020%2C1370118961527-splitting%2Fnode7%252C60020%252C1370118961527.1370122562614
> 2013-06-02 10:45:20,990 WARN
> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
> logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting]
> installed = 2 but only 0 done
> 2013-06-02 10:45:20,990 ERROR
> org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
> processing event M_META_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> node7,60020,1370118961527, will retry
>         at
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:69)
>         at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
> Caused by: java.io.IOException: error or interrupted while splitting
> logs in [hdfs://node3:9000/hbase/.logs/node7,60020,1370118961527-splitting]
> Task = installed = 2 done = 0 error = 2
>         at
> org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:299)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:371)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitAllLogs(MasterFileSystem.java:341)
>         at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitAllLogs(MasterFileSystem.java:288)
>         at
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:63)
>         ... 4 more
> 2013-06-02 10:45:20,990 INFO
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler:
> Splitting all logs for node7,60020,1370118961527
>
>
>
> I tried to look at the file:
> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls
> /hbase/.logs/node7,60020,1370118961527-splitting/
> Found 2 items
> -rw-r--r--   3 hbase supergroup   41578360 2013-06-01 17:36
>
> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370122562614
> -rw-r--r--   3 hbase supergroup   18153457 2013-06-01 19:36
>
> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370129764666
>
> I also tried to get the files from HDFS:
> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -get
>
> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370122562614
> .
> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -get
>
> /hbase/.logs/node7,60020,1370118961527-splitting/node7%2C60020%2C1370118961527.1370129764666
> .
> hadoop@node3:~/hadoop-1.0.3$ ll
> -rw-r--r--  1 hadoop hadoop 41578360 jun  2 10:13
> node7%2C60020%2C1370118961527.1370122562614
> -rw-r--r--  1 hadoop hadoop 18153457 jun  2 10:13
> node7%2C60020%2C1370118961527.1370129764666
>
>
> I looked into the code, and the issues is because "installed = 2 but
> only 0 done" (batch.done != batch.installed) which mean we are coming
> back from "waitForSplittingCompletion" even if the split is not really
> completed. The error message warning "Error splitting" is displayed
> into the "SplitLogManager.setDone" message but is not giving much
> information to track the root cause.
>
> So I looked into the regionservers, and found this:
> Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a
> directory:
> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/recovered.edits
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1170)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
>         at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:601)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1070)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>         at $Proxy8.create(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:601)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy8.create(Unknown Source)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248)
>         ... 19 more
>
> So I looked in HDFS and found this:
> hadoop@node3:~/hadoop-1.0.3$ bin/hadoop fs -ls
> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/
> Found 5 items
> drwxr-xr-x   - hbase supergroup          0 2013-06-01 16:56
> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/.oldlogs
> -rw-r--r--   3 hbase supergroup        841 2013-06-01 16:56
> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/.regioninfo
> drwxr-xr-x   - hbase supergroup          0 2013-06-01 16:56
> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/@
> drwxr-xr-x   - hbase supergroup          0 2013-06-01 16:56
> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/a
> -rw-r--r--   3 hbase supergroup       5159 2013-06-01 15:43
> /hbase/work_proposed/1d44b0630ed7785106a87a2bd4993551/recovered.edits
>
>
> So, 2 things again here.
>
> 1) Should the region server send more information of the failure to
> the master the the master can display the failure cause on the logs?
> 2) recovered.edits should  have not been there. And the failure is
> because the SplitLogWorker tries to create a recovered.edits file but
> it's already existing.
>
> This happens because my cluster failed twice in a row yesterday. And
> most probably the second time occurs while the logsplit was still in
> progress. Therefor, this file is just a left over.
>
> Now, what should we do. I can manually delete this file and the
> process will take over and everything will be fine, but should we
> update the HLogSplitter or HLogSplitter to validate this file is not
> there first and delete it if required? What's the risk of deleting
> this file automatically? There should not be any other process writing
> it, right? Worst case, we can also rename it so if really required we
> can replay it later?
>
> For my own needs, I removed the file manually and everything is
> running now (re-assigning regions, etc.).
>
> What's your opinion on that? Should we remove the file? Or should we
> not because it's to dangerous?
>
> JM
>

Re: Never ending distributed log split

Reply via email to