For why the Master died, this looks like https://issues.apache.org/jira/browse/HBASE-3617
The region server died on OutOfMemoryError: Java heap space, and I can see that it's at 4GB on 4GB so you must be inserting really large values from a lot of clients at the same time. Is that right? J-D On Mon, Mar 14, 2011 at 6:43 AM, 茅旭峰 <[email protected]> wrote: > Hi, > > We are trying a small hbase environment, including 2-node masters, 4-node > regionservers, and 3-node zookeeper cluster. > Now we're doing stress tests, using cells of bigger size(4mb - 15mb). What > we run into was, > > 1. one of the regionserver(10.241.67.33) aborted due to OOM, with log > ==== > 2011-03-14 20:35:28,235 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region > table2,SXGdRIcvExFco4zZIPwhcKjjYlo=,1300096043004.8b00c3cb40e3e827f46aaaeeafce044d. > 2011-03-14 20:35:28,236 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > with memory above low water. > 2011-03-14 20:35:28,236 ERROR > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher failed > for entry > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$WakeupFlushThread@2ca74909 > > ...... > > 2011-03-14 20:36:40,396 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > with memory above low water. > 2011-03-14 20:36:40,396 ERROR > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher failed > for entry > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$WakeupFlushThread@3b97477e > ...... > 2011-03-14 20:41:36,541 INFO org.apache.hadoop.hbase.regionserver.HRegion: > aborted compaction on region > table2,Ig92Q4M9wK3GagLmUMDhHvlQBr0=,1300097018887.62a5a22f09f37e099679b2f6a12490b6. > after 1mins, 39sec > 2011-03-14 20:41:36,541 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for > region we are not serving; 944c234d8f64eccd0833d86349623309 > 2011-03-14 20:41:36,544 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=255, > usedHeap=3982, maxHeap=3983): Uncaught exception in service thread > regionserver60020.compactor > java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1093) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87) > at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326) > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927) > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > ...... > 2011-03-14 20:41:38,067 INFO org.apache.hadoop.hbase.regionserver.Leases: > regionserver60020 closing leases > 2011-03-14 20:41:38,067 INFO org.apache.hadoop.hbase.regionserver.Leases: > regionserver60020 closed leases > 2011-03-14 20:41:38,085 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Closed zookeeper sessionid=0x32e9a2f2678000a > 2011-03-14 20:41:38,093 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x32e9a2f2678000a closed > 2011-03-14 20:41:38,093 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2011-03-14 20:41:38,100 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x12e9a2e21be0011 closed > 2011-03-14 20:41:38,101 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2011-03-14 20:41:38,102 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 > exiting > 2011-03-14 20:41:38,117 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; > hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-14,5,main] > 2011-03-14 20:41:38,117 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook > 2011-03-14 20:41:38,117 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook > thread. > 2011-03-14 20:41:38,220 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished. > > ==== > > 2.the master shutdown itself, with log > > ==== > 2011-03-14 20:41:03,991 DEBUG > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Cache hit for row <> in tableName .META.: location server cloud134:60020, > location region name .META.,,1.1028785192 > 2011-03-14 20:41:35,415 FATAL org.apache.hadoop.hbase.master.HMaster: Remote > unexpected exception > java.io.IOException: Call to /10.241.67.33:60020 failed on local exception: > java.io.EOFException > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:806) > at > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy6.closeRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:589) > at > org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1093) > at > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1672) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:375) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477) > 2011-03-14 20:41:35,417 INFO org.apache.hadoop.hbase.master.HMaster: > Aborting > 2011-03-14 20:41:35,421 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0), message=Connection refused, > region=1bd40fb0466d9b0b52f93f9af41b3a8e > 2011-03-14 20:41:35,424 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0), message=Connection refused, > region=1c564f7c72355b78369aa1467d9852e5 > 2011-03-14 20:41:35,425 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0), message=Connection refused, > region=1cabffd7a9213938ac1181f05c19f2d7 > ..... > 2011-03-14 20:41:36,013 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0), message=Connection refused, > region=f694004eeb2dea68a65ec3b378764805 > 2011-03-14 20:41:36,017 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0), message=Connection refused, > region=fa3f2e8c175d794aa8b18e3e8995d382 > 2011-03-14 20:41:36,025 INFO > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor: > cloud135:60000.timeoutMonitor exiting > 2011-03-14 20:41:36,093 DEBUG org.apache.hadoop.hbase.master.HMaster: > Stopping service threads > 2011-03-14 20:41:36,093 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60000 > 2011-03-14 20:41:36,093 INFO org.apache.hadoop.hbase.master.CatalogJanitor: > cloud135:60000-CatalogJanitor exiting > 2011-03-14 20:41:36,094 INFO org.apache.hadoop.hbase.master.HMaster$1: > cloud135:60000-BalancerChore exiting > 2011-03-14 20:41:36,094 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC > Server listener on 60000 > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 0 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 4 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC > Server Responder > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 3 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 1 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 8 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 7 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 5 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.hbase.master.HMaster: > Stopping infoServer > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 2 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 6 on 60000: exiting > 2011-03-14 20:41:36,096 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 9 on 60000: exiting > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.hbase.master.LogCleaner: > master-cloud135:60000.oldLogCleaner exiting > 2011-03-14 20:41:36,096 INFO org.mortbay.log: Stopped > [email protected]:60010 > 2011-03-14 20:41:36,222 DEBUG > org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog tracker > org.apache.hadoop.hbase.catalog.CatalogTracker@62770d2e > 2011-03-14 20:41:36,323 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Closed zookeeper sessionid=0x12e9a2e21be0038 > 2011-03-14 20:41:36,328 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2011-03-14 20:41:36,328 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x12e9a2e21be0038 closed > 2011-03-14 20:41:36,340 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2011-03-14 20:41:36,340 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x22e9a2e22390033 closed > 2011-03-14 20:41:36,340 INFO org.apache.hadoop.hbase.master.HMaster: HMaster > main thread exiting > > ==== > > the died regionserver is cloud140(10.241.67.33). What confusing us is why > the death of regionserver would cause the master shut down itself. > On the other hand, the standby master started, but we still could not > manipulate the table because too many WALs to recover. > > I guess it relates to the previous thread 'HMaster startup is very slow, and > always run into out-of-memory issue' I submited. > > I guess both issues are about gc tuning, but I can only get guidelines like > use CMS policy, any others? maybe particularly for > bigger cell. > > Thanks and regards, > > Mao Xu-Feng >
