yeah, oversubscribed on memory is my guess. It's the most common problem I see, esp when the failure happens during load like MR jobs.
On Wed, Jun 25, 2014 at 1:29 PM, John Vines <[email protected]> wrote: > Why are your tservers dying. You say it only shows startup with no errors, > but what about from before you restart? Keep in mind that the out and err > files get clobbered on restart, so you need to check these before you > restart them. > > I have a hunch that you're either experiencing OOM errors, which is an > indication of poor accumulo configuration, or you're using ZK locks, which > is an indicator of various things from poor network to poor system > configuration. > > > On Wed, Jun 25, 2014 at 2:10 PM, Sean Busbey <[email protected]> wrote: > >> What version of Accumulo? >> >> What version of Hadoop? >> >> What does your server memory and per-role allocation look like? >> >> Can you paste the tserver debug log? >> >> >> >> On Wed, Jun 25, 2014 at 1:01 PM, Jacob Rust <[email protected]> >> wrote: >> >>> I am trying to create an inverted text index for a table using accumulo >>> input/output format in a java mapreduce program. When the job reaches the >>> reduce phase and creates the table / tries to write to it the tablet >>> servers begin to die. >>> >>> Now when I do a start-all.sh the tablet servers start for about a minute >>> and then die again. Any idea as to why the mapreduce job is killing the >>> tablet servers and/or how to bring the tablet servers back up without >>> failing? >>> >>> This is on a 12 node cluster with low quality hardware. >>> >>> The java code I am running is here http://pastebin.com/ti7Qz19m >>> >>> The log files on each tablet server only display the startup >>> information, no errors. The log files on the master server show these >>> errors http://pastebin.com/LymiTfB7 >>> >>> >>> >>> >>> -- >>> Jacob Rust >>> Software Intern >>> >> >> >> >> -- >> Sean >> > > -- Sean
