OOME may manifest in one place but be caused by some other behavior altogether. Its an Error. You can't tell for sure what damage its done to the running process (Though, in your stack trace, an OOME during the array copy could likely be because of very large cells). Rather than let the damaged server continue, HBase is conservative and shuts itself down to minimize possible dataloss whenever it gets an OOME (It has kept aside an emergency memory supply that it releases on OOME so the shutdown can 'complete' successfully).
Are you doing large multiputs? Do you have lots of handlers running? If the multiputs are held up because things are running slow, memory used out on the handlers could throw you over especially if your heap is small. What size heap are you running with? St.Ack On Tue, Aug 10, 2010 at 3:26 PM, Stuart Smith <[email protected]> wrote: > Hello, > > I'm seeing errors like so: > > 010-08-10 12:58:38,938 DEBUG > org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got > ZooKeeper event, state: Disconnected, type: None, path: null > 2010-08-10 12:58:38,939 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Disconnected, type: None, path: null > > 2010-08-10 12:58:38,941 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError, > aborting. > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2786) > at > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:942) > > Then I see: > > 2010-08-10 12:58:39,408 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 79 on 60020, call close(-2793534857581898004) from > 192.168.195.88:41233: error: java.io.IOException: Server not running, aborting > java.io.IOException: Server not running, aborting > > And finally: > > 2010-08-10 12:58:39,514 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stop requested, clearing > toDo despite exception > 2010-08-10 12:58:39,515 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > 2010-08-10 12:58:39,515 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 1 on 60020: exiting > > And the server begins to shut down. > > Now, it's very likely these are due to retrieving unusually large cells - in > fact, that's my current assumption.. I'm seeing M/R tasks fail with > intermittently with the same issue on the read of cell data. > > My question is why does this bring the whole regionserver down? I would think > the regionserver would just fail the Get(), and move on... > > Am I misdiagnosing the error? Or is it the case that if I want different > behavior, I should pony up with some code? :) > > Take care, > -stu > > > > >
