The log says that the region server tried to talk to the region server
"dp7.abcd.com" and it timed out after 60 seconds, and that happened
during a split which is pretty bad. As the log says:

org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Abort; we got
an error after point-of-no-return

So what happened to that machine?

I understand the logs can look opaque but they usually give you some
clue so please investigate them on that machine and please don't post
them back here without analyzing them.

J-D

On Wed, Apr 4, 2012 at 8:09 PM, Qian Ye <[email protected]> wrote:
> Hi all:
>
> I'm using cdh3u3 (based on hbase-0.90.4 and hadoop-0.20.2), and my cluster
> contains about 15 servers. The size of data in the hdfs is about 10T, and
> about half of this data are in hbase. When running the customized mapreduce
> job which need not scan the whole table in hbase, it's fine. However, when
> I want to backup hbase tables with export tools provided by HBase, one
> Region Server down and the backup mapreduce job failed. The logs of the
> region server is like:
>
>
> 2012-04-04 10:11:53,817 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running
> rollback/cleanup of failed split of
> dailylaunchindex,2012-03-10_4e045076431fe31e74000032_d645cc647e72c5f1cc1ff3c460dcd515,1333303778356.2262c07cfc672237e61aa6113e785f55.;
> Failed dp13.abcd.com
> ,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380
> java.io.IOException: Failed dp13.abcd.com
> ,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380
> at
> org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:297)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:156)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:87)
> Caused by: java.net.SocketTimeoutException: Call to
> dp7.abcd.com/10.18.10.60:60020 failed on socket timeout exception:
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote=
> dp7.abcd.com/10.18.10.60:60020]
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> at $Proxy9.put(Unknown Source)
> at
> org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:122)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1392)
> at
> org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:375)
> at
> org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:342)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote=
> dp7.abcd.com/10.18.10.60:60020]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> at java.io.DataInputStream.readInt(DataInputStream.java:370)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
> *2012-04-04 10:11:53,821 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=dp13.abcd.com,60020,1333436117207, load=(requests=18470,
> regions=244, usedHeap=6108, maxHeap=7973): Abort; we got an error after
> point-of-no-return*
> 2012-04-04 10:11:53,821 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> requests=6015, regions=244, stores=244, storefiles=557,
> storefileIndexSize=414, memstoreSize=1792, compactionQueueSize=4,
> flushQueueSize=0, usedHeap=6156, maxHeap=7973, blockCacheSize=1335446112,
> blockCacheFree=336613152, blockCacheCount=20071,
> blockCacheHitCount=65577505, blockCacheMissCount=30264896,
> blockCacheEvictedCount=23463221, blockCacheHitRatio=68,
> blockCacheHitCachingRatio=73
> 2012-04-04 10:11:53,824 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Abort; we got
> an error after point-of-no-return
> 2012-04-04 10:11:53,824 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> regionserver60020.compactor exiting
> 2012-04-04 10:11:53,967 INFO
> org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> 2012-04-04 10:11:54,062 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver60020.cacheFlusher exiting
> 2012-04-04 10:11:54,837 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner -7174278054087519478
> 2012-04-04 10:11:54,951 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner 5883825799758583233
> 2012-04-04 10:11:55,224 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner 5800828333591092756
> 2012-04-04 10:11:55,261 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner 5153473163996089139
> 2012-04-04 10:11:55,332 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner 3494993576774767091
> 2012-04-04 10:11:55,684 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner -1265087592996306143
> 2012-04-04 10:11:55,849 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down
> and client tried to access missing scanner -7174278054087519478
> ...
> 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 27 on 60020: exiting
> 2012-04-04 10:11:55,930 INFO
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to
> stop the worker thread
> 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 25 on 60020: exiting
> 2012-04-04 10:11:55,933 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
> *2012-04-04 10:11:55,933 WARN
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> inteurrpted while waiting for task, exiting*
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:485)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:205)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
> at java.lang.Thread.run(Thread.java:662)
> 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 26 on 60020: exiting
> 2012-04-04 10:11:55,933 INFO
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> dp13.abcd.com,60020,1333436117207 exiting
>
>
> My questions are:
>
> 1. can I tune some parameters to make the export mapreduce job works?
> 2. is there any other way to backup my hbase tables in this situation? I
> don't have another cluster and I cannot stop the serving when I need to
> backup the tables.
>
>
> Thanks for any advice on this issue.
>
> --
> With Regards!
>
> Ye, Qian

Reply via email to