I see it: http://pastebin.com/tgQHBSLj
Interesting situation indeed. Any thoughts on how to avoid it? Have compaction running more aggressively? -Jack On Mon, Sep 27, 2010 at 1:00 PM, Jean-Daniel Cryans <[email protected]> wrote: > Can you grep around the region server log files to see what was going > on with that region during the previous run? There's only 1 way I see > this happening, and it would require that your region server would be > serving thousands of regions and that this region was in queue to be > compacted behind all those thousands of regions, and in the mean time > the flush blocker of 90 seconds would timeout at least enough times so > that you would end up with all those store files (which according to > my quick calculation, would mean that it took about 23 hours before > the region server was able to compact that region which is something > I've never seen, and it would have killed your region server with > OOME). Do you see this message often? > > LOG.info("Waited " + (System.currentTimeMillis() - fqe.createTime) + > "ms on a compaction to clean up 'too many store files'; waited " + > "long enough... proceeding with flush of " + > region.getRegionNameAsString()); > > Thx, > > J-D > > On Mon, Sep 27, 2010 at 12:54 PM, Jack Levin <[email protected]> wrote: >> Strange: this is what I have: >> >> <property> >> <name>hbase.hstore.blockingStoreFiles</name> >> <value>7</value> >> <description> >> If more than this number of StoreFiles in any one Store >> (one StoreFile is written per flush of MemStore) then updates are >> blocked for this HRegion until a compaction is completed, or >> until hbase.hstore.blockingWaitTime has been exceeded. >> </description> >> </property> >> >> I wonder how it got there, I've deleted the files. >> >> -jack >> >> >> On Mon, Sep 27, 2010 at 12:42 PM, Jean-Daniel Cryans >> <[email protected]> wrote: >>> I'd say it's the: >>> >>> 2010-09-27 12:16:15,291 INFO >>> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 943 >>> file(s) in att of >>> img833,dsc03711s.jpg,1285493435306.da57612ee69d7baaefe84 >>> eeb0e49f240. into >>> hdfs://namenode-rd.imageshack.us:9000/hbase/img833/da57612ee69d7baaefe84eeb0e49f240/.tmp, >>> sequenceid=618626242 >>> >>> That killed you. I wonder how it was able to get there since the >>> Memstore blocks flushing if the upper threshold for compactions was >>> reached (default is 7, did you set it to 1000 by any chance?). >>> >>> J-D >>> >>> On Mon, Sep 27, 2010 at 12:29 PM, Jack Levin <[email protected]> wrote: >>>> Strange situation, cold start the cluster, and one of the servers just >>>> started getting more and more consuming of RAM, you can see it form >>>> the screenshot I am attaching. Here is the log: >>>> http://pastebin.com/MDPJzLQJ >>>> >>>> There seem to be nothing happen, and then it just runs out of Memory, >>>> and of course shuts down. >>>> >>>> Here is GC log before the crash: http://pastebin.com/GwdC3nhx >>>> >>>> Strange , that other region servers stay up and consuming little >>>> memory (or rather stay stable.). >>>> >>>> Any ideas? >>>> >>>> -Jack >>>> >>> >> >
