The log says that the region server tried to talk to the region server "dp7.abcd.com" and it timed out after 60 seconds, and that happened during a split which is pretty bad. As the log says:
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Abort; we got an error after point-of-no-return So what happened to that machine? I understand the logs can look opaque but they usually give you some clue so please investigate them on that machine and please don't post them back here without analyzing them. J-D On Wed, Apr 4, 2012 at 8:09 PM, Qian Ye <[email protected]> wrote: > Hi all: > > I'm using cdh3u3 (based on hbase-0.90.4 and hadoop-0.20.2), and my cluster > contains about 15 servers. The size of data in the hdfs is about 10T, and > about half of this data are in hbase. When running the customized mapreduce > job which need not scan the whole table in hbase, it's fine. However, when > I want to backup hbase tables with export tools provided by HBase, one > Region Server down and the backup mapreduce job failed. The logs of the > region server is like: > > > 2012-04-04 10:11:53,817 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running > rollback/cleanup of failed split of > dailylaunchindex,2012-03-10_4e045076431fe31e74000032_d645cc647e72c5f1cc1ff3c460dcd515,1333303778356.2262c07cfc672237e61aa6113e785f55.; > Failed dp13.abcd.com > ,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380 > java.io.IOException: Failed dp13.abcd.com > ,60020,1333436117207-daughterOpener=54cb17a22de6a19edcbec447362b0380 > at > org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:297) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:156) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:87) > Caused by: java.net.SocketTimeoutException: Call to > dp7.abcd.com/10.18.10.60:60020 failed on socket timeout exception: > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote= > dp7.abcd.com/10.18.10.60:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy9.put(Unknown Source) > at > org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:122) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1392) > at > org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:375) > at > org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:342) > Caused by: java.net.SocketTimeoutException: 60000 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.18.10.66:24672 remote= > dp7.abcd.com/10.18.10.60:60020] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477) > *2012-04-04 10:11:53,821 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=dp13.abcd.com,60020,1333436117207, load=(requests=18470, > regions=244, usedHeap=6108, maxHeap=7973): Abort; we got an error after > point-of-no-return* > 2012-04-04 10:11:53,821 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > requests=6015, regions=244, stores=244, storefiles=557, > storefileIndexSize=414, memstoreSize=1792, compactionQueueSize=4, > flushQueueSize=0, usedHeap=6156, maxHeap=7973, blockCacheSize=1335446112, > blockCacheFree=336613152, blockCacheCount=20071, > blockCacheHitCount=65577505, blockCacheMissCount=30264896, > blockCacheEvictedCount=23463221, blockCacheHitRatio=68, > blockCacheHitCachingRatio=73 > 2012-04-04 10:11:53,824 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Abort; we got > an error after point-of-no-return > 2012-04-04 10:11:53,824 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > regionserver60020.compactor exiting > 2012-04-04 10:11:53,967 INFO > org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. > 2012-04-04 10:11:54,062 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: > regionserver60020.cacheFlusher exiting > 2012-04-04 10:11:54,837 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down > and client tried to access missing scanner -7174278054087519478 > 2012-04-04 10:11:54,951 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down > and client tried to access missing scanner 5883825799758583233 > 2012-04-04 10:11:55,224 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down > and client tried to access missing scanner 5800828333591092756 > 2012-04-04 10:11:55,261 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down > and client tried to access missing scanner 5153473163996089139 > 2012-04-04 10:11:55,332 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down > and client tried to access missing scanner 3494993576774767091 > 2012-04-04 10:11:55,684 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down > and client tried to access missing scanner -1265087592996306143 > 2012-04-04 10:11:55,849 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Server shutting down > and client tried to access missing scanner -7174278054087519478 > ... > 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 27 on 60020: exiting > 2012-04-04 10:11:55,930 INFO > org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to > stop the worker thread > 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 25 on 60020: exiting > 2012-04-04 10:11:55,933 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer > *2012-04-04 10:11:55,933 WARN > org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker > inteurrpted while waiting for task, exiting* > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:205) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165) > at java.lang.Thread.run(Thread.java:662) > 2012-04-04 10:11:55,930 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 26 on 60020: exiting > 2012-04-04 10:11:55,933 INFO > org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker > dp13.abcd.com,60020,1333436117207 exiting > > > My questions are: > > 1. can I tune some parameters to make the export mapreduce job works? > 2. is there any other way to backup my hbase tables in this situation? I > don't have another cluster and I cannot stop the serving when I need to > backup the tables. > > > Thanks for any advice on this issue. > > -- > With Regards! > > Ye, Qian
