Thanks for your help again Stack... sorry i don't have the logs. Will do a better job of saving them. By the way, this time the insert job maintained about 22k rows/sec all night without any pauses, and even though it was sequential insertion, it did a nice job of rotating the active region around the cluster.
As for the hostnames, there are no problems in .89, and nothing is onerous by any means... we are just trying to come to some level of familiarity before putting any real data into hbase. EC2/RightScale make it very easy to add/remove regionservers to the cluster with the click of a button, which is the reason that the hosts file can change more often then you'd want to modify it manually. We're going to go the route of having each newly added regionserver append it's name to the host file of every other server in our EC2 account (~30 servers). The only downsides I see there are that it doesn't scale very elegantly, and that it gets complicated if you want to launch multiple regionservers or new clients at the same time. For the sake of brainstorming, maybe it's possible to have the master always broadcast IP addresses and have all communication done via IP. This may be more robust anyway. Then the first time a new regionserver or cient gets an unfamiliar IP address, it can try to figure out the hostname (the same way the master currently does this), and cache it somewhere. The hostname could be added alongside the IP address or replace it in the logs for convenience. Thanks again, Matt On Wed, Sep 29, 2010 at 12:53 PM, Stack <[email protected]> wrote: > On Wed, Sep 29, 2010 at 9:22 AM, Matt Corgan <[email protected]> wrote: > > Everything is working fine now. > > > > My best guess is that when we upgraded from 0.20.6 to 0.89.20100726 there > > was a change in hostname resolution (either by hbase, hdfs, or us). > > Resolution is done differently in 0.89. > > RS checks into master. Master tells it what it sees as its hostname > and ever after the RS will use what the master told it when its > talking tot the master. Only the master's DNS setup needs make some > bit of sense. > > > In > > 0.20.6, our regionservers looked each other up via IP address, but after > the > > upgrade it switched to hostname, and some of our servers were not aware > of > > each other's hostnames. Then the CompactSplitThread did the compaction > part > > but failed to split because it got an unknown host exception. Is that > > plausible? > > > > You have log from that time? > > > > Is there a way to configure it so that regionservers are referenced by IP > > addresses instead of hostnames? When we add a regionserver to a running > > cluster, it's pretty easy to automatically add it's name to the master's > > hosts file, but it's less reliable to try to add it to all other > > regionservers and client machines. Maybe just not even populate the > > master's hosts file? > > > > I guess the downside there is that we'd lose readability in the logs, > etc.. > > > > Well, is there a problem w/ how 0.89 works? > > I suppose clients need to be in agreement w/ master regards hostnames. > Is that too onerous an expectation? > > If master can't resolve hostnames it'll just use IPs. I suppose you > could use this fact to force your cluster all IP and I suppose we > could include a flag to go all IPs all over but I'd be interested in > how 0.89 naming is failing you so can try fix. > > Thanks, > St.Ack > > > > > > On Tue, Sep 28, 2010 at 3:15 PM, Matt Corgan <[email protected]> > wrote: > > > >> I'll try to reproduce it and capture some comprehensive log files, but > >> we're testing on EC2 and had terminated some of the servers before > noticing > >> what was happening. > >> > >> I think it's been doing successful compactions all along because there > are > >> only 3 files in that directory. Here's the hdfs files for that > particular > >> table (line 109): http://pastebin.com/8fsDmh6M > >> > >> If i stopped inserting to the cluster altogether to give it time to > >> breathe, is the intended behaviour that the region should be split after > >> compaction because it's size is greater than 256 MB? I'll try again to > >> reproduce, but I'm fairly certain it's just sitting there based on > >> network/disk/cpu activity. > >> > >> > >> On Tue, Sep 28, 2010 at 12:01 PM, Stack <[email protected]> wrote: > >> > >>> On Mon, Sep 27, 2010 at 4:26 PM, Matt Corgan <[email protected]> > wrote: > >>> > I'm sequentially importing ~1 billion small rows (32 byte keys) into > a > >>> table > >>> > called StatAreaModelLink. I realize that sequential insertion isn't > >>> > efficient by design, but I'm not in a hurry so I let it run all > weekend. > >>> > It's been proceeding quickly except for ~20s stalls every minute or > so. > >>> > > >>> > I also noticed that one regionserver was getting all the load and > just > >>> > figured that after each split the later region stayed on the current > >>> node. > >>> > Turns out the last region stopped splitting altogether and now has a > >>> 33gb > >>> > store file. > >>> > > >>> > >>> Interesting. > >>> > >>> > >>> > I started importing on 0.20.6, but switched to 0.89.20100726 today. > >>> They > >>> > both seem to act similarly. Using all default settings except > >>> VERSIONS=1. > >>> > > >>> > That regionserver's logs constantly say "Compaction requested for > >>> region... > >>> > because regionserver60020.cacheFlusher" > >>> > > >>> > http://pastebin.com/WJDs7ZbM > >>> > > >>> > Am I doing something wrong, like not giving it enough time to > >>> flush/compact? > >>> > There are 23 previous regions that look ok. > >>> > > >>> > >>> I wonder if a compaction is running and its just taking a long time. > >>> Grep for 'Starting compaction' in your logs. See when last started? > >>> > >>> I see you continue to flush. Try taking the load off. > >>> > >>> You might also do a: > >>> > >>> > bin/hadoop fs -lsr /hbase > >>> > >>> ... and pastbin it. I'd be looking for a region with a bunch of files > in > >>> it. > >>> > >>> Finally, you've read about the bulk load [1] tool? > >>> > >>> St.Ack > >>> > >>> 1. http://hbase.apache.org/docs/r0.89.20100726/bulk-loads.html > >>> St.Ack > >>> > >> > >> > > >
