Re: running out of memory. Reasons?

Jack Levin Mon, 27 Sep 2010 15:01:35 -0700

This is my config:
http://pastebin.com/ghuztyrS



Note that my Heap is 5GB, while 0.1 for upper and lower memstore.  And
the flush size is set 1/2 of the default ~30MB.  I want to keep things
less in memory, and flush more often.  My region files are huge, 1GB
now, but will grow to 2 and 3 GB (I am adjusting them).  The reason
for that is that I want to keep under 1000 regions under RS, and my
access pattern to the cells is very concentrated to top 5% latest
photo uploads.   So I can store files deep, and not really care about
hitting them hard, also, we run varnish to store most of the files in
frontends, so hbase here is really for file store integrity.   Still
very curios how that one region got to so many files.  Perhaps there
should be a provision to compact more aggressively?

-Jack


On Mon, Sep 27, 2010 at 2:24 PM, Jean-Daniel Cryans <[email protected]> wrote:
> The short answer is: because you have regions that haven't flushed yet
> that the oldest hlogs still have edits from. 532 regions is part of
> the reason, and I guess you are doing some importing so the updates
> must be spread through a lot of them.
>
> But, let's run some maths. 32 HLogs of ~64MBs each is about 2GB
> whereas each region will flush when it gets 64MB so since you have 532
> of them and guessing that your loading pattern is a bit random then
> it'd take 33GB of RAM to hold everything before it all starts
> flushing. Also there's a global memstore max size which is 40%
> (default) so since you gave 5000MB of heap, this means that you cannot
> have more than 2000MB of data in all the memstores inside each region
> server. This is actually great, because 32 HLogs together is about
> that same size, but where everything gets screwed up is with the total
> number of regions getting loaded.
>
> So, you can set the max number of HLogs higher, but you still have the
> same amount of memory so you'll run into the max global memstore size
> instead of max hlogs files which will still have the effect of force
> flushing small regions (which triggers compactions, and everything
> becomes way less efficient than it is designed to). I still cannot
> explain how you ended up with 934 store files to compact, but you
> should definitely take great care of getting that number of regions /
> region server down to a more manageable level. Did you play with
> MAX_FILESIZE on that table?
>
> J-D
>
> On Mon, Sep 27, 2010 at 1:46 PM, Jack Levin <[email protected]> wrote:
>> http://pastebin.com/S7ETUpSb
>>
>> and
>>
>> Too many hlogs files:
>>
>> http://pastebin.com/j3GMynww
>>
>> Why do I have so many hlogs?
>>
>> -Jack
>>
>>
>> On Mon, Sep 27, 2010 at 1:33 PM, Jean-Daniel Cryans <[email protected]> 
>> wrote:
>>> You could set the blocking store files setting higher (we have it at
>>> 17 here), but looking at the log I see it was blocking for 90secs only
>>> to flush a 1MB file. Why was that flush requested? Global memstore
>>> size reached? The log from a few lines before should tell
>>>
>>> J-D
>>>
>>> On Mon, Sep 27, 2010 at 1:18 PM, Jack Levin <[email protected]> wrote:
>>>> I see it:  http://pastebin.com/tgQHBSLj
>>>>
>>>> Interesting situation indeed.  Any thoughts on how to avoid it?  Have
>>>> compaction running more aggressively?
>>>>
>>>> -Jack
>>>>
>>>> On Mon, Sep 27, 2010 at 1:00 PM, Jean-Daniel Cryans <[email protected]> 
>>>> wrote:
>>>>> Can you grep around the region server log files to see what was going
>>>>> on with that region during the previous run? There's only 1 way I see
>>>>> this happening, and it would require that your region server would be
>>>>> serving thousands of regions and that this region was in queue to be
>>>>> compacted behind all those thousands of regions, and in the mean time
>>>>> the flush blocker of 90 seconds would timeout at least enough times so
>>>>> that you would end up with all those store files (which according to
>>>>> my quick calculation, would mean that it took about 23 hours before
>>>>> the region server was able to compact that region which is something
>>>>> I've never seen, and it would have killed your region server with
>>>>> OOME). Do you see this message often?
>>>>>
>>>>>       LOG.info("Waited " + (System.currentTimeMillis() - fqe.createTime) +
>>>>>          "ms on a compaction to clean up 'too many store files'; waited " 
>>>>> +
>>>>>          "long enough... proceeding with flush of " +
>>>>>          region.getRegionNameAsString());
>>>>>
>>>>> Thx,
>>>>>
>>>>> J-D
>>>>>
>>>>> On Mon, Sep 27, 2010 at 12:54 PM, Jack Levin <[email protected]> wrote:
>>>>>> Strange: this is what I have:
>>>>>>
>>>>>>  <property>
>>>>>>    <name>hbase.hstore.blockingStoreFiles</name>
>>>>>>    <value>7</value>
>>>>>>    <description>
>>>>>>    If more than this number of StoreFiles in any one Store
>>>>>>    (one StoreFile is written per flush of MemStore) then updates are
>>>>>>    blocked for this HRegion until a compaction is completed, or
>>>>>>    until hbase.hstore.blockingWaitTime has been exceeded.
>>>>>>    </description>
>>>>>>  </property>
>>>>>>
>>>>>> I wonder how it got there, I've deleted the files.
>>>>>>
>>>>>> -jack
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 27, 2010 at 12:42 PM, Jean-Daniel Cryans
>>>>>> <[email protected]> wrote:
>>>>>>> I'd say it's the:
>>>>>>>
>>>>>>> 2010-09-27 12:16:15,291 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 943
>>>>>>> file(s) in att of
>>>>>>> img833,dsc03711s.jpg,1285493435306.da57612ee69d7baaefe84
>>>>>>> eeb0e49f240.  into
>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img833/da57612ee69d7baaefe84eeb0e49f240/.tmp,
>>>>>>> sequenceid=618626242
>>>>>>>
>>>>>>> That killed you. I wonder how it was able to get there since the
>>>>>>> Memstore blocks flushing if the upper threshold for compactions was
>>>>>>> reached (default is 7, did you set it to 1000 by any chance?).
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Mon, Sep 27, 2010 at 12:29 PM, Jack Levin <[email protected]> wrote:
>>>>>>>> Strange situation, cold start the cluster, and one of the servers just
>>>>>>>> started getting more and more consuming of RAM, you can see it form
>>>>>>>> the screenshot I am attaching.  Here is the log:
>>>>>>>> http://pastebin.com/MDPJzLQJ
>>>>>>>>
>>>>>>>> There seem to be nothing happen, and then it just runs out of Memory,
>>>>>>>> and of course shuts down.
>>>>>>>>
>>>>>>>> Here is GC log before the crash:  http://pastebin.com/GwdC3nhx
>>>>>>>>
>>>>>>>> Strange , that other region servers stay up and consuming little
>>>>>>>> memory (or rather stay stable.).
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: running out of memory. Reasons?

Reply via email to