Thanks for the detailed response. I will take steps to ensure I have enough memory and that things aren't getting swapped out. I may move Zookeeper to it own micro instance to make sure it isn't impacted by all the other Accumulo and Hadoop processes.
On Fri, Mar 30, 2012 at 8:31 AM, Eric Newton <[email protected]> wrote: > If a client fails to communicate with zookeeper for long enough, it loses > its lock, and loses its exclusive access to the tablets it was serving. > When that happens, it kills itself. > > Here are the reasons for a failure to communicate in a timely fashion: > - tablet server has swapped out > - tablet server needs to do a stop-the-world garbage collection > - zookeeper swaps out > > The linux kernel aggressively swaps out processes in order to expand the > disk cache. The degree that it will tend to do this is controlled with the > swappiness kernel setting. Set this to zero: > > # echo 0 >/proc/sys/vm/swappiness > > Ensure that you have ample memory. > > You can see the status of available memory by looking for the "gc" lines in > the tablet server debug log: > > 22 16:28:54,199 [tabletserver.TabletServer] DEBUG: gc ParNew=0.33(+0.01) > secs ConcurrentMarkSweep=0.01(+0.01) secs freemem=108,455,440(+43,075,312) > totalmem=132,055,040 > > In particular, watch for the delta "(+0.01)" numbers. If this exceeds the > zookeeper timeout (30 seconds by default), then you will most likely lose > the server. You will notice this happening when the freemem approaches > totalmem. > > I don't have much experience running Accumulo on VMs, but I have seen VMs > have strange behavior with respect to timekeeping. That might be another > possible culprit. > > -Eric > > On Fri, Mar 30, 2012 at 9:00 AM, Jared winick <[email protected]> wrote: >> >> I am running 1.4.0 RC6 in a single server configuration on EC2. After >> over 1 day of successful MapReduce ingest, i see this as the first of >> many errors/warnings in the monitor's recent logs. >> >> "Problem getting real goal state: >> org.apache.zookeeper.KeeperException$ConnectionLossException: >> KeeperErrorCode = ConnectionLoss for >> /accumulo/d36894a2-e760-4273-9c9f-dfa64ed8f4bc/masters/goal_state" >> >> This message is followed by attempts to reconnect to Zookeeper, and then >> finally >> >> "Lost tablet server lock (reason = SESSION_EXPIRED), exiting." >> >> Zookeeper still appears to be running at this time. Obviously running >> everything on a single VM is certainly not the ideal configuration. >> Does anyone know what the root cause of my problem is and how I can >> best avoid it happening again? Also, should i just stop and restart >> Accumulo and everything should be OK again if Zookeeper is now >> available and responsive? >> >> Thanks a lot. >> >> Jared Winick > >
