Re: Cascading failure leads to loss of all region servers

Bryan Beaudreault Wed, 11 Apr 2012 18:18:05 -0700

Hi Stack,

Thanks for the reply.  Unfortunately, our first instinct was to restart the
region servers and when they came up it appears the compaction was able to
succeed (perhaps because on a fresh restart the heap was low enough to
succeed).  I listed the files under that region and there is now only 1
file.


We are going to be running this job again in the near future.  We are going
to try to rate limit the writes a bit (though only 10 reducers were running
at once to begin with), and I will keep in mind your suggestions if it
happens despite that.

- Bryan

On Wed, Apr 11, 2012 at 4:35 PM, Stack <[email protected]> wrote:

> On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault
> <[email protected]> wrote:
> > We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2,
> > hosting about 17k regions.
>
> Thats too many but thats another story.
>
> > That pattern repeats on all of the region servers, every 5-8 minutes
> until
> > all are down. Should there be some safeguards on a compaction causing a
> > region server to go OOM?  The region appears to only be around 425mb in
> > size.
> >
>
> My guess is that Region A has a massive or corrupt record in it.
>
> You could disable the region for now while you are figuring whats wrong
> w/it.
>
> If you list files under this region, what do you see?  Are there many?
>
> Can you see what files are selected for compaction?  This will narrow
> the set to look at.  You could poke at them w/ the hfile tool.  See
> '8.7.5.2.2. HFile Tool' in the reference guide.
>
> St.Ack
>

Re: Cascading failure leads to loss of all region servers

Reply via email to