Hello Stack, > Rather than let the damaged server continue, HBase is > conservative
Ah. I see. That does make sense. > Are you doing large multiputs? Do you have lots of > handlers running? Actually this was a M/R task doing lots of reads. But I do have automation (standalone java hbase client, not M/R) that runs every hour doing lots of puts. I think the two could have overlapped and caused issues. > What size heap are you running with? Hbase has 4GB, Hadoop had 2GB (on regionserver/datanode/tasktracker computers). What I actually ended up doing was catching the OOME's in my M/R tasks, and looking at the cell size. One of the cells was 500 MB :|. So that was bad. I've taken to avoiding large cells in the M/R task, and things have smoothed out. It looks like I should just be a little more circumspect with how much data I cram in a cell. Mostly I limit them to 64 MB, but for one particular tasks I limited to 512 MB.. and I'm getting a decent amount of data now, so inevitably I hit the limit... Thanks! Take care, -stu --- On Tue, 8/10/10, Stack <[email protected]> wrote: > From: Stack <[email protected]> > Subject: Re: Avoiding OutOfMemory Java heap space in region servers > To: [email protected] > Date: Tuesday, August 10, 2010, 6:40 PM > OOME may manifest in one place but be > caused by some other behavior > altogether. Its an Error. You can't tell for > sure what damage its > done to the running process (Though, in your stack trace, > an OOME > during the array copy could likely be because of very large > cells). > Rather than let the damaged server continue, HBase is > conservative and > shuts itself down to minimize possible dataloss whenever it > gets an > OOME (It has kept aside an emergency memory supply that it > releases on > OOME so the shutdown can 'complete' successfully). > > Are you doing large multiputs? Do you have lots of > handlers running? > If the multiputs are held up because things are running > slow, memory > used out on the handlers could throw you over especially if > your heap > is small. > > What size heap are you running with? > > St.Ack > > > > On Tue, Aug 10, 2010 at 3:26 PM, Stuart Smith <[email protected]> > wrote: > > Hello, > > > > I'm seeing errors like so: > > > > 010-08-10 12:58:38,938 DEBUG > org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: > Got ZooKeeper event, state: Disconnected, type: None, path: > null > > 2010-08-10 12:58:38,939 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got > ZooKeeper event, state: Disconnected, type: None, path: > null > > > > 2010-08-10 12:58:38,941 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: > OutOfMemoryError, aborting. > > java.lang.OutOfMemoryError: Java heap space > > at > java.util.Arrays.copyOf(Arrays.java:2786) > > at > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133) > > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:942) > > > > Then I see: > > > > 2010-08-10 12:58:39,408 INFO > org.apache.hadoop.ipc.HBaseServer: IPC Server handler 79 on > 60020, call close(-2793534857581898004) from > 192.168.195.88:41233: error: java.io.IOException: Server not > running, aborting > > java.io.IOException: Server not running, aborting > > > > And finally: > > > > 2010-08-10 12:58:39,514 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stop > requested, clearing toDo despite exception > > 2010-08-10 12:58:39,515 INFO > org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 > > 2010-08-10 12:58:39,515 INFO > org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on > 60020: exiting > > > > And the server begins to shut down. > > > > Now, it's very likely these are due to retrieving > unusually large cells - in fact, that's my current > assumption.. I'm seeing M/R tasks fail with intermittently > with the same issue on the read of cell data. > > > > My question is why does this bring the whole > regionserver down? I would think the regionserver would just > fail the Get(), and move on... > > > > Am I misdiagnosing the error? Or is it the case that > if I want different behavior, I should pony up with some > code? :) > > > > Take care, > > -stu > > > > > > > > > > >
