On Tue, Aug 10, 2010 at 3:40 PM, Stack <[email protected]> wrote: > OOME may manifest in one place but be caused by some other behavior > altogether. Its an Error. You can't tell for sure what damage its > done to the running process (Though, in your stack trace, an OOME > during the array copy could likely be because of very large cells). > Rather than let the damaged server continue, HBase is conservative and > shuts itself down to minimize possible dataloss whenever it gets an > OOME (It has kept aside an emergency memory supply that it releases on > OOME so the shutdown can 'complete' successfully). > I understand that your above saying meant that HBase shut down the service for data protection. But can't HBase avoid OOME at the first place? Or the OOME situation is a pending bug in HBase?
It sounds that HBase can give OOME whenever it under heavy loads -- I recall that several people reporting OOME for unknown reasons. > > Are you doing large multiputs? Do you have lots of handlers running? > If the multiputs are held up because things are running slow, memory > used out on the handlers could throw you over especially if your heap > is small. > > What size heap are you running with? > By the way, can someone talk about the optimal heap size? Say, I have 16GB in my box, and I use 2GB for my DataNode/TaskTracker etc. Presumably, I'd like to set up my RS heapsize >=12GB to cache as much data in memory as possible. But I heard people saying that too much heap size will cause GC pause issue. Can someone give a detailed analysis for what I should do? Thanks, Sean > > St.Ack > > > > On Tue, Aug 10, 2010 at 3:26 PM, Stuart Smith <[email protected]> wrote: > > Hello, > > > > I'm seeing errors like so: > > > > 010-08-10 12:58:38,938 DEBUG > org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got > ZooKeeper event, state: Disconnected, type: None, path: null > > 2010-08-10 12:58:38,939 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Disconnected, type: None, path: null > > > > 2010-08-10 12:58:38,941 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError, > aborting. > > java.lang.OutOfMemoryError: Java heap space > > at java.util.Arrays.copyOf(Arrays.java:2786) > > at > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133) > > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:942) > > > > Then I see: > > > > 2010-08-10 12:58:39,408 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 79 on 60020, call close(-2793534857581898004) from > 192.168.195.88:41233: error: java.io.IOException: Server not running, > aborting > > java.io.IOException: Server not running, aborting > > > > And finally: > > > > 2010-08-10 12:58:39,514 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stop requested, clearing > toDo despite exception > > 2010-08-10 12:58:39,515 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > > 2010-08-10 12:58:39,515 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 1 on 60020: exiting > > > > And the server begins to shut down. > > > > Now, it's very likely these are due to retrieving unusually large cells - > in fact, that's my current assumption.. I'm seeing M/R tasks fail with > intermittently with the same issue on the read of cell data. > > > > My question is why does this bring the whole regionserver down? I would > think the regionserver would just fail the Get(), and move on... > > > > Am I misdiagnosing the error? Or is it the case that if I want different > behavior, I should pony up with some code? :) > > > > Take care, > > -stu > > > > > > > > > > >
