Everything is working fine now.

My best guess is that when we upgraded from 0.20.6 to 0.89.20100726 there
was a change in hostname resolution (either by hbase, hdfs, or us).  In
0.20.6, our regionservers looked each other up via IP address, but after the
upgrade it switched to hostname, and some of our servers were not aware of
each other's hostnames.  Then the CompactSplitThread did the compaction part
but failed to split because it got an unknown host exception.  Is that
plausible?

Is there a way to configure it so that regionservers are referenced by IP
addresses instead of hostnames?  When we add a regionserver to a running
cluster, it's pretty easy to automatically add it's name to the master's
hosts file, but it's less reliable to try to add it to all other
regionservers and client machines.  Maybe just not even populate the
master's hosts file?

I guess the downside there is that we'd lose readability in the logs, etc..


On Tue, Sep 28, 2010 at 3:15 PM, Matt Corgan <[email protected]> wrote:

> I'll try to reproduce it and capture some comprehensive log files, but
> we're testing on EC2 and had terminated some of the servers before noticing
> what was happening.
>
> I think it's been doing successful compactions all along because there are
> only 3 files in that directory.  Here's the hdfs files for that particular
> table (line 109): http://pastebin.com/8fsDmh6M
>
> If i stopped inserting to the cluster altogether to give it time to
> breathe, is the intended behaviour that the region should be split after
> compaction because it's size is greater than 256 MB?  I'll try again to
> reproduce, but I'm fairly certain it's just sitting there based on
> network/disk/cpu activity.
>
>
> On Tue, Sep 28, 2010 at 12:01 PM, Stack <[email protected]> wrote:
>
>> On Mon, Sep 27, 2010 at 4:26 PM, Matt Corgan <[email protected]> wrote:
>> > I'm sequentially importing ~1 billion small rows (32 byte keys) into a
>> table
>> > called StatAreaModelLink.  I realize that sequential insertion isn't
>> > efficient by design, but I'm not in a hurry so I let it run all weekend.
>> >  It's been proceeding quickly except for ~20s stalls every minute or so.
>> >
>> > I also noticed that one regionserver was getting all the load and just
>> > figured that after each split the later region stayed on the current
>> node.
>> >  Turns out the last region stopped splitting altogether and now has a
>> 33gb
>> > store file.
>> >
>>
>> Interesting.
>>
>>
>> > I started importing on 0.20.6, but switched to 0.89.20100726 today.
>>  They
>> > both seem to act similarly.  Using all default settings except
>> VERSIONS=1.
>> >
>> > That regionserver's logs constantly say "Compaction requested for
>> region...
>> > because regionserver60020.cacheFlusher"
>> >
>> > http://pastebin.com/WJDs7ZbM
>> >
>> > Am I doing something wrong, like not giving it enough time to
>> flush/compact?
>> >  There are 23 previous regions that look ok.
>> >
>>
>> I wonder if a compaction is running and its just taking a long time.
>> Grep for 'Starting compaction' in your logs.  See when last started?
>>
>> I see you continue to flush.  Try taking the load off.
>>
>> You might also do a:
>>
>> > bin/hadoop fs -lsr /hbase
>>
>> ... and pastbin it.  I'd be looking for a region with a bunch of files in
>> it.
>>
>> Finally, you've read about the bulk load [1] tool?
>>
>> St.Ack
>>
>> 1. http://hbase.apache.org/docs/r0.89.20100726/bulk-loads.html
>> St.Ack
>>
>
>

Reply via email to