On Sat, Aug 14, 2010 at 1:26 AM, Sean Bigdatafun
<[email protected]> wrote:
> On Tue, Aug 10, 2010 at 3:40 PM, Stack <[email protected]> wrote:
>
>> OOME may manifest in one place but be caused by some other behavior
>> altogether.  Its an Error.  You can't tell for sure what damage its
>> done to the running process (Though, in your stack trace, an OOME
>> during the array copy could likely be because of very large cells).
>> Rather than let the damaged server continue, HBase is conservative and
>> shuts itself down to minimize possible dataloss whenever it gets an
>> OOME (It has kept aside an emergency memory supply that it releases on
>> OOME so the shutdown can 'complete' successfully).
>>
> I understand that your above saying meant that HBase shut down the service
> for data protection. But can't HBase avoid OOME at the first place? Or the
> OOME situation is a pending bug in HBase?
>
> It sounds that HBase can give OOME whenever it under heavy loads -- I recall
> that several people reporting OOME for unknown reasons.
>

There is always a reason for an OOME.

In our experience, the only remaining cause of OOME in hbase is
because clients are trying to load up many megabyte cells concurrently
or they are using large client write buffers so big payloads are being
passed to the server in each RPC request.  Our RPC is not streaming.
It passes byte arrays.  If lots of handlers in the server and all are
being passed big payloads, then its possible that at that moment the
server heap is overwhelmed.

Is this your case?

If you need help diagnosing, let us help.  When hbase OOME's, it dumps
the heap.  Put it somewhere we can pull it.

The server keeps account of heap used except here at the edge where
RPC is taking in requests.

The fix is a little awkward but we'll get to it.  Meantime,
workarounds are up server heap or cut the number of handlers or use
smaller client write buffer or don't try loading cells > 10MB or so --
use HDFS direct and keep location in hbase (hbase not suited to
carrying large stuff in cells).

Who are the 'several people' reporting OOMEs?  I see this week Ted Yu
talking of an OOME.  It looks like evidence for large cells in his
case  so hypothesis outlined above would seem to hold for his case.


>>
>> Are you doing large multiputs?  Do you have lots of handlers running?
>> If the multiputs are held up because things are running slow, memory
>> used out on the handlers could throw you over especially if your heap
>> is small.
>>
>> What size heap are you running with?
>>
>

You don't answer my questions above.


> By the way, can someone talk about the optimal heap size? Say, I have 16GB
> in my box, and I use 2GB for my DataNode/TaskTracker etc. Presumably, I'd
> like to set up my RS heapsize >=12GB to cache as much data in memory as
> possible. But I heard people saying that too much heap size will cause GC
> pause issue.
>

4-8G is what fellas normally run with.

> Can someone give a detailed analysis for what I should do?
>

What you need beyond the above?
St.Ack

Reply via email to