On Tue, Aug 10, 2010 at 3:40 PM, Stack <[email protected]> wrote:

> OOME may manifest in one place but be caused by some other behavior
> altogether.  Its an Error.  You can't tell for sure what damage its
> done to the running process (Though, in your stack trace, an OOME
> during the array copy could likely be because of very large cells).
> Rather than let the damaged server continue, HBase is conservative and
> shuts itself down to minimize possible dataloss whenever it gets an
> OOME (It has kept aside an emergency memory supply that it releases on
> OOME so the shutdown can 'complete' successfully).
>
I understand that your above saying meant that HBase shut down the service
for data protection. But can't HBase avoid OOME at the first place? Or the
OOME situation is a pending bug in HBase?

It sounds that HBase can give OOME whenever it under heavy loads -- I recall
that several people reporting OOME for unknown reasons.





>
> Are you doing large multiputs?  Do you have lots of handlers running?
> If the multiputs are held up because things are running slow, memory
> used out on the handlers could throw you over especially if your heap
> is small.
>
> What size heap are you running with?
>

By the way, can someone talk about the optimal heap size? Say, I have 16GB
in my box, and I use 2GB for my DataNode/TaskTracker etc. Presumably, I'd
like to set up my RS heapsize >=12GB to cache as much data in memory as
possible. But I heard people saying that too much heap size will cause GC
pause issue.

Can someone give a detailed analysis for what I should do?

Thanks,
Sean


>
> St.Ack
>
>
>
> On Tue, Aug 10, 2010 at 3:26 PM, Stuart Smith <[email protected]> wrote:
> > Hello,
> >
> >   I'm seeing errors like so:
> >
> > 010-08-10 12:58:38,938 DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got
> ZooKeeper event, state: Disconnected, type: None, path: null
> > 2010-08-10 12:58:38,939 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> >
> > 2010-08-10 12:58:38,941 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError,
> aborting.
> > java.lang.OutOfMemoryError: Java heap space
> >        at java.util.Arrays.copyOf(Arrays.java:2786)
> >        at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133)
> >        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:942)
> >
> > Then I see:
> >
> > 2010-08-10 12:58:39,408 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 79 on 60020, call close(-2793534857581898004) from
> 192.168.195.88:41233: error: java.io.IOException: Server not running,
> aborting
> > java.io.IOException: Server not running, aborting
> >
> > And finally:
> >
> > 2010-08-10 12:58:39,514 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stop requested, clearing
> toDo despite exception
> > 2010-08-10 12:58:39,515 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> > 2010-08-10 12:58:39,515 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 1 on 60020: exiting
> >
> > And the server begins to shut down.
> >
> > Now, it's very likely these are due to retrieving unusually large cells -
> in fact, that's my current assumption.. I'm seeing M/R tasks fail with
> intermittently with the same issue on the read of cell data.
> >
> > My question is why does this bring the whole regionserver down? I would
> think the regionserver would just fail the Get(), and move on...
> >
> > Am I misdiagnosing the error? Or is it the case that if I want different
> behavior, I should pony up with some code? :)
> >
> > Take care,
> >  -stu
> >
> >
> >
> >
> >
>

Reply via email to