You can alter instance.zookeeper.timeout in accumulo.

Defaults to 30 seconds. You can override by specifying in accumulo-site.xml or by using `config -s instance.zookeeper.timeout=60s` in the Accumulo shell.

Beware that this will potentially make your system less responsive to these failures (the amount of time for Accumulo to notice failure, assign and recover will increase with your new timeout).

As far as logs go, you should be able to see something near the end of the tserver*.debug.log file that lets you know that the tabletserver lost its lock. You shouldn't have to dig really hard if this is the case.

On 1/13/14, 1:02 PM, Anthony F wrote:
Yes, system swappiness is set to 0.  I'll run again and gather more logs.

Is there a zookeeper timeout setting that I can adjust to avoid this
issue and is that advisable?  Basically, the tservers are colocated with
HDFS datanodes and Hadoop nodemanagers.  The machines are overallocated
in terms of RAM.  So, I have a feeling that when a map-reduce job is
kicked off, it causes the tserver to page out to swap space.  Once the
map-reduce job finishes and the bulk ingest is kicked off, the tserver
is paged back in and the ZK timeout causes a shutdown.


On Mon, Jan 13, 2014 at 9:19 AM, Eric Newton <[email protected]
<mailto:[email protected]>> wrote:

    We would need to see a little bit more of the logs prior to the
    error.  The tablet server is losing its connection to zookeeper.

    I have seen problems like this when a tablet server has been pushed
    into swap.  When the server is tasked to do work, it begins to use
    the swapped out memory, and the process is paused while the pages
    are swapped back in.

    The pauses prevent the zookeeper client API from sending keep-alive
    messages to zookeeper, so zookeeper thinks the process has died, and
    the tablet server loses its lock.

    Have you changed your system's swappiness to zero as outlined in the
    README?

    Check the debug lines containing "gc" and verify the server has
    plenty of free space.

    -Eric


    On Mon, Jan 13, 2014 at 8:11 AM, Anthony F <[email protected]
    <mailto:[email protected]>> wrote:

        I am experiencing an issue when bulk importing the results of a
        mapreduce job of losing one or more tservers.  After the job is
        finished and the bulk import is kicked off, I observe the
        following in the lost tserver's logs:

        2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO :
        Got unexpected zookeeper event: None for
        /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/recover
        y
        2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO :
        Got unexpected zookeeper event: None for
        /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
        2014-01-10 23:14:21,369 [zookeeper.DistributedWorkQueue] ERROR:
        Failed to look for work
        org.apache.zookeeper.KeeperException$ConnectionLossException:
        KeeperErrorCode = ConnectionLoss for
        /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq

        However, the bulk import actually succeeded and all is well with
        the data in the table.  I have to restart the tserver each time
        this happens which is not a viable solution for production.

        I am using Accumulo 1.5.0.  Tservers have 12G of RAM and index
        caching, CF bloom filters, and groups are turned on for the
        table in question. Any ideas why this might be happening?

        Thanks,
        Anthony



Reply via email to