Ditto. I don't like the idea of sprinkling additional stuff into log4j, but I am in favor of trying to make it easier to recognize when tservers die due to OOMs if there are more suggestions.
On Wednesday, February 27, 2013, Christopher wrote: > I agree with John Vines. > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Wed, Feb 27, 2013 at 12:32 PM, John Vines <[email protected]> wrote: > > I don't like the idea of blending manual logging with log4j in a single > > file. It's in the .err file already, I don't think anything else is > > necessary. > > > > > > > > On Wed, Feb 27, 2013 at 3:27 PM, Adam Fuchs <[email protected]> wrote: > >> > >> So, question for the community: inside bin/accumulo we have: > >> -XX:OnOutOfMemoryError="kill -9 %p" > >> Should this also append a log message? Something like: > >> -XX:OnOutOfMemoryError="kill -9 %p; echo "ran out of memory >> > >> logfilename" > >> Is this necessary, or should the OutOfMemoryException still find its way > >> to the regular log? > >> > >> Adam > >> > >> > >> > >> On Wed, Feb 27, 2013 at 3:17 PM, Mike Hugo <[email protected]> wrote: > >>> > >>> I'm chalking this up to a mis-configured server. It looks like during > >>> the install on this server the accumulo-env.sh file was copied from the > >>> examples, but rather than setting editing it to set the JAVA_HOME, > >>> HADOOP_HOME, and ZOOKEEPER_HOME, the entire file contents were > replaced with > >>> those env variables. > >>> > >>> I'm assuming this caused us to pick up the default (?) _OPTS settings > >>> rather than the correct ones we should have been getting based on our > server > >>> memory capacity from the examples. So we had a bunch of accumulo > related > >>> java processes all running with memory settings that were way out of > whack > >>> from what they should have been. > >>> > >>> To solve it I copied in the files from the conf/examples directory > again > >>> and made sure everything was set up correctly and restarted everything. > >>> > >>> We never did see anything in out log files or .out / .err logs > indicating > >>> the source of the problem, but the above is my best guess as to what > was > >>> going on. > >>> > >>> Thanks again for all the tips and pointers! > >>> > >>> Mike > >>> > >>> > >>> On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs <[email protected]> > wrote: > >>>> > >>>> There are a few primary reasons why your tablet server would die: > >>>> 1. Lost lock in Zookeeper. If the tablet server and zookeeper can't > >>>> communicate with each other then the lock will timeout and the tablet > server > >>>> will kill itself. This should show up as several messages in the > tserver > >>>> log. If this happens when a tablet server is really busy (lots of > threads > >>>> doing stuff) then the log message about the lost lock can be pretty > far back > >>>> in the queue. Java garbage collection can cause long pauses that > inhibit the > >>>> tserver/zookeeper messages. Zookeeper can also get overwhelmed and > behave > >>>> poorly if the server it's running on swaps it out. > >>>> 2. Problems talking with the master. If a tablet server is too slow in > >>>> communicating with the master then the master will try to kill it. > This > >>>> should show up in the master log, and also will be noted in the > tserver log. > >>>> 3. Out of memory. If the tserver JVM runs out of memory it will > >>>> terminate. As John mentioned, this will be in the .err or .out files > in the > >>>> log directory. > >>>> > >>>> Adam > >>>> > >>>> > >>>> > >>>> On Wed, Feb 27, 2013 at 12:10 PM, Mike Hugo <[email protected]> wrote: > >>>>> > >>>>> After running an ingest process via map reduce for about an hour or > so, > >>>>> one of our tserver fails. It happens pretty consistently, we're > able to > >>>>> replicate it without too much difficulty. > >>>>> > >>>>> I'm looking in the $ACCUMULO_HOME/logs directory for clues as to why > >>>>> the tserver fails, but I'm not seeing much that points to a cause
