If a client fails to communicate with zookeeper for long enough, it loses its lock, and loses its exclusive access to the tablets it was serving. When that happens, it kills itself.
Here are the reasons for a failure to communicate in a timely fashion: - tablet server has swapped out - tablet server needs to do a stop-the-world garbage collection - zookeeper swaps out The linux kernel aggressively swaps out processes in order to expand the disk cache. The degree that it will tend to do this is controlled with the swappiness kernel setting. Set this to zero: # echo 0 >/proc/sys/vm/swappiness Ensure that you have ample memory. You can see the status of available memory by looking for the "gc" lines in the tablet server debug log: 22 16:28:54,199 [tabletserver.TabletServer] DEBUG: gc ParNew=0.33(+0.01) secs ConcurrentMarkSweep=0.01(+0.01) secs freemem=108,455,440(+43,075,312) totalmem=132,055,040 In particular, watch for the delta "(+0.01)" numbers. If this exceeds the zookeeper timeout (30 seconds by default), then you will most likely lose the server. You will notice this happening when the freemem approaches totalmem. I don't have much experience running Accumulo on VMs, but I have seen VMs have strange behavior with respect to timekeeping. That might be another possible culprit. -Eric On Fri, Mar 30, 2012 at 9:00 AM, Jared winick <[email protected]> wrote: > I am running 1.4.0 RC6 in a single server configuration on EC2. After > over 1 day of successful MapReduce ingest, i see this as the first of > many errors/warnings in the monitor's recent logs. > > "Problem getting real goal state: > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for > /accumulo/d36894a2-e760-4273-9c9f-dfa64ed8f4bc/masters/goal_state" > > This message is followed by attempts to reconnect to Zookeeper, and then > finally > > "Lost tablet server lock (reason = SESSION_EXPIRED), exiting." > > Zookeeper still appears to be running at this time. Obviously running > everything on a single VM is certainly not the ideal configuration. > Does anyone know what the root cause of my problem is and how I can > best avoid it happening again? Also, should i just stop and restart > Accumulo and everything should be OK again if Zookeeper is now > available and responsive? > > Thanks a lot. > > Jared Winick >
