Hi all: I'm using cdh3u3 (based on hbase-0.90.4 and hadoop-0.20.2), and my cluster contains about 15 servers. The size of data in the hdfs is about 10T, and about half of this data are in hbase. When running the customized mapreduce job which need not scan the whole table in hbase, it's fine. However, when I want to backup hbase tables with export tools provided by HBase, one Region Server down and the backup mapreduce job failed. The logs of the region server is like:
2012-04-04 10:11:53,817 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback/cleanup of failed split of dailylaunchindex,2012-03-10_4e045076431fe31e74000032_d645cc647e72c5f1cc1ff3c460dcd515,1333303778356.2262c07cfc672237e61aa6113e785f55.; Failed dp13.abcd.com ,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380 java.io.IOException: Failed dp13.abcd.com ,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380 at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:297) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:156) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:87) Caused by: java.net.SocketTimeoutException: Call to dp7.abcd.com/10.18.10.60:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote= dp7.abcd.com/10.18.10.60:60020] at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy9.put(Unknown Source) at org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:122) at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1392) at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:375) at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:342) Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote= dp7.abcd.com/10.18.10.60:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477) *2012-04-04 10:11:53,821 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=dp13.abcd.com,60020,1333436117207, load=(requests=18470, regions=244, usedHeap=6108, maxHeap=7973): Abort; we got an error after point-of-no-return* 2012-04-04 10:11:53,821 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: requests=6015, regions=244, stores=244, storefiles=557, storefileIndexSize=414, memstoreSize=1792, compactionQueueSize=4, flushQueueSize=0, usedHeap=6156, maxHeap=7973, blockCacheSize=1335446112, blockCacheFree=336613152, blockCacheCount=20071, blockCacheHitCount=65577505, blockCacheMissCount=30264896, blockCacheEvictedCount=23463221, blockCacheHitRatio=68, blockCacheHitCachingRatio=73 2012-04-04 10:11:53,824 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Abort; we got an error after point-of-no-return 2012-04-04 10:11:53,824 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: regionserver60020.compactor exiting 2012-04-04 10:11:53,967 INFO org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. 2012-04-04 10:11:54,062 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: regionserver60020.cacheFlusher exiting 2012-04-04 10:11:54,837 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down and client tried to access missing scanner -7174278054087519478 2012-04-04 10:11:54,951 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down and client tried to access missing scanner 5883825799758583233 2012-04-04 10:11:55,224 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down and client tried to access missing scanner 5800828333591092756 2012-04-04 10:11:55,261 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down and client tried to access missing scanner 5153473163996089139 2012-04-04 10:11:55,332 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down and client tried to access missing scanner 3494993576774767091 2012-04-04 10:11:55,684 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down and client tried to access missing scanner -1265087592996306143 2012-04-04 10:11:55,849 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down and client tried to access missing scanner -7174278054087519478 ... 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 27 on 60020: exiting 2012-04-04 10:11:55,930 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 25 on 60020: exiting 2012-04-04 10:11:55,933 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer *2012-04-04 10:11:55,933 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker inteurrpted while waiting for task, exiting* java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:205) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) at java.lang.Thread.run(Thread.java:662) 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 26 on 60020: exiting 2012-04-04 10:11:55,933 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker dp13.abcd.com,60020,1333436117207 exiting My questions are: 1. can I tune some parameters to make the export mapreduce job works? 2. is there any other way to backup my hbase tables in this situation? I don't have another cluster and I cannot stop the serving when I need to backup the tables. Thanks for any advice on this issue. -- With Regards! Ye, Qian
