Yes, system swappiness is set to 0. I'll run again and gather more logs. Is there a zookeeper timeout setting that I can adjust to avoid this issue and is that advisable? Basically, the tservers are colocated with HDFS datanodes and Hadoop nodemanagers. The machines are overallocated in terms of RAM. So, I have a feeling that when a map-reduce job is kicked off, it causes the tserver to page out to swap space. Once the map-reduce job finishes and the bulk ingest is kicked off, the tserver is paged back in and the ZK timeout causes a shutdown.
On Mon, Jan 13, 2014 at 9:19 AM, Eric Newton <[email protected]> wrote: > We would need to see a little bit more of the logs prior to the error. > The tablet server is losing its connection to zookeeper. > > I have seen problems like this when a tablet server has been pushed into > swap. When the server is tasked to do work, it begins to use the swapped > out memory, and the process is paused while the pages are swapped back in. > > The pauses prevent the zookeeper client API from sending keep-alive > messages to zookeeper, so zookeeper thinks the process has died, and the > tablet server loses its lock. > > Have you changed your system's swappiness to zero as outlined in the > README? > > Check the debug lines containing "gc" and verify the server has plenty of > free space. > > -Eric > > > On Mon, Jan 13, 2014 at 8:11 AM, Anthony F <[email protected]> wrote: > >> I am experiencing an issue when bulk importing the results of a mapreduce >> job of losing one or more tservers. After the job is finished and the bulk >> import is kicked off, I observe the following in the lost tserver's logs: >> >> 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got >> unexpected zookeeper event: None for >> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/recover >> y >> 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got >> unexpected zookeeper event: None for >> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq >> 2014-01-10 23:14:21,369 [zookeeper.DistributedWorkQueue] ERROR: Failed to >> look for work >> org.apache.zookeeper.KeeperException$ConnectionLossException: >> KeeperErrorCode = ConnectionLoss for >> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq >> >> However, the bulk import actually succeeded and all is well with the data >> in the table. I have to restart the tserver each time this happens which >> is not a viable solution for production. >> >> I am using Accumulo 1.5.0. Tservers have 12G of RAM and index caching, >> CF bloom filters, and groups are turned on for the table in question. Any >> ideas why this might be happening? >> >> Thanks, >> Anthony >> > >
