Can you grep around the region server log files to see what was going
on with that region during the previous run? There's only 1 way I see
this happening, and it would require that your region server would be
serving thousands of regions and that this region was in queue to be
compacted behind all those thousands of regions, and in the mean time
the flush blocker of 90 seconds would timeout at least enough times so
that you would end up with all those store files (which according to
my quick calculation, would mean that it took about 23 hours before
the region server was able to compact that region which is something
I've never seen, and it would have killed your region server with
OOME). Do you see this message often?

       LOG.info("Waited " + (System.currentTimeMillis() - fqe.createTime) +
          "ms on a compaction to clean up 'too many store files'; waited " +
          "long enough... proceeding with flush of " +
          region.getRegionNameAsString());

Thx,

J-D

On Mon, Sep 27, 2010 at 12:54 PM, Jack Levin <[email protected]> wrote:
> Strange: this is what I have:
>
>  <property>
>    <name>hbase.hstore.blockingStoreFiles</name>
>    <value>7</value>
>    <description>
>    If more than this number of StoreFiles in any one Store
>    (one StoreFile is written per flush of MemStore) then updates are
>    blocked for this HRegion until a compaction is completed, or
>    until hbase.hstore.blockingWaitTime has been exceeded.
>    </description>
>  </property>
>
> I wonder how it got there, I've deleted the files.
>
> -jack
>
>
> On Mon, Sep 27, 2010 at 12:42 PM, Jean-Daniel Cryans
> <[email protected]> wrote:
>> I'd say it's the:
>>
>> 2010-09-27 12:16:15,291 INFO
>> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 943
>> file(s) in att of
>> img833,dsc03711s.jpg,1285493435306.da57612ee69d7baaefe84
>> eeb0e49f240.  into
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img833/da57612ee69d7baaefe84eeb0e49f240/.tmp,
>> sequenceid=618626242
>>
>> That killed you. I wonder how it was able to get there since the
>> Memstore blocks flushing if the upper threshold for compactions was
>> reached (default is 7, did you set it to 1000 by any chance?).
>>
>> J-D
>>
>> On Mon, Sep 27, 2010 at 12:29 PM, Jack Levin <[email protected]> wrote:
>>> Strange situation, cold start the cluster, and one of the servers just
>>> started getting more and more consuming of RAM, you can see it form
>>> the screenshot I am attaching.  Here is the log:
>>> http://pastebin.com/MDPJzLQJ
>>>
>>> There seem to be nothing happen, and then it just runs out of Memory,
>>> and of course shuts down.
>>>
>>> Here is GC log before the crash:  http://pastebin.com/GwdC3nhx
>>>
>>> Strange , that other region servers stay up and consuming little
>>> memory (or rather stay stable.).
>>>
>>> Any ideas?
>>>
>>> -Jack
>>>
>>
>

Reply via email to