We would need to see a little bit more of the logs prior to the error. The tablet server is losing its connection to zookeeper.
I have seen problems like this when a tablet server has been pushed into swap. When the server is tasked to do work, it begins to use the swapped out memory, and the process is paused while the pages are swapped back in. The pauses prevent the zookeeper client API from sending keep-alive messages to zookeeper, so zookeeper thinks the process has died, and the tablet server loses its lock. Have you changed your system's swappiness to zero as outlined in the README? Check the debug lines containing "gc" and verify the server has plenty of free space. -Eric On Mon, Jan 13, 2014 at 8:11 AM, Anthony F <[email protected]> wrote: > I am experiencing an issue when bulk importing the results of a mapreduce > job of losing one or more tservers. After the job is finished and the bulk > import is kicked off, I observe the following in the lost tserver's logs: > > 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got > unexpected zookeeper event: None for > /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/recover > y > 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got > unexpected zookeeper event: None for > /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq > 2014-01-10 23:14:21,369 [zookeeper.DistributedWorkQueue] ERROR: Failed to > look for work > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for > /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq > > However, the bulk import actually succeeded and all is well with the data > in the table. I have to restart the tserver each time this happens which > is not a viable solution for production. > > I am using Accumulo 1.5.0. Tservers have 12G of RAM and index caching, CF > bloom filters, and groups are turned on for the table in question. Any > ideas why this might be happening? > > Thanks, > Anthony >
