Hi Stack, Thanks for the reply. Unfortunately, our first instinct was to restart the region servers and when they came up it appears the compaction was able to succeed (perhaps because on a fresh restart the heap was low enough to succeed). I listed the files under that region and there is now only 1 file.
We are going to be running this job again in the near future. We are going to try to rate limit the writes a bit (though only 10 reducers were running at once to begin with), and I will keep in mind your suggestions if it happens despite that. - Bryan On Wed, Apr 11, 2012 at 4:35 PM, Stack <[email protected]> wrote: > On Wed, Apr 11, 2012 at 10:24 AM, Bryan Beaudreault > <[email protected]> wrote: > > We have 16 m1.xlarge ec2 machines as region servers, running cdh3u2, > > hosting about 17k regions. > > Thats too many but thats another story. > > > That pattern repeats on all of the region servers, every 5-8 minutes > until > > all are down. Should there be some safeguards on a compaction causing a > > region server to go OOM? The region appears to only be around 425mb in > > size. > > > > My guess is that Region A has a massive or corrupt record in it. > > You could disable the region for now while you are figuring whats wrong > w/it. > > If you list files under this region, what do you see? Are there many? > > Can you see what files are selected for compaction? This will narrow > the set to look at. You could poke at them w/ the hfile tool. See > '8.7.5.2.2. HFile Tool' in the reference guide. > > St.Ack >
