Re: Bulk ingest losing tablet server

Anthony F Mon, 13 Jan 2014 11:23:50 -0800

Yes, system swappiness is set to 0.  I'll run again and gather more logs.

Is there a zookeeper timeout setting that I can adjust to avoid this issue
and is that advisable?  Basically, the tservers are colocated with HDFS
datanodes and Hadoop nodemanagers.  The machines are overallocated in terms
of RAM.  So, I have a feeling that when a map-reduce job is kicked off, it
causes the tserver to page out to swap space.  Once the map-reduce job
finishes and the bulk ingest is kicked off, the tserver is paged back in
and the ZK timeout causes a shutdown.



On Mon, Jan 13, 2014 at 9:19 AM, Eric Newton <[email protected]> wrote:

> We would need to see a little bit more of the logs prior to the error.
>  The tablet server is losing its connection to zookeeper.
>
> I have seen problems like this when a tablet server has been pushed into
> swap.  When the server is tasked to do work, it begins to use the swapped
> out memory, and the process is paused while the pages are swapped back in.
>
> The pauses prevent the zookeeper client API from sending keep-alive
> messages to zookeeper, so zookeeper thinks the process has died, and the
> tablet server loses its lock.
>
> Have you changed your system's swappiness to zero as outlined in the
> README?
>
> Check the debug lines containing "gc" and verify the server has plenty of
> free space.
>
> -Eric
>
>
> On Mon, Jan 13, 2014 at 8:11 AM, Anthony F <[email protected]> wrote:
>
>> I am experiencing an issue when bulk importing the results of a mapreduce
>> job of losing one or more tservers.  After the job is finished and the bulk
>> import is kicked off, I observe the following in the lost tserver's logs:
>>
>> 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got
>> unexpected zookeeper event: None for
>> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/recover
>> y
>> 2014-01-10 23:14:21,312 [zookeeper.DistributedWorkQueue] INFO : Got
>> unexpected zookeeper event: None for
>> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
>> 2014-01-10 23:14:21,369 [zookeeper.DistributedWorkQueue] ERROR: Failed to
>> look for work
>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>> KeeperErrorCode = ConnectionLoss for
>> /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/bulk_failed_copyq
>>
>> However, the bulk import actually succeeded and all is well with the data
>> in the table.  I have to restart the tserver each time this happens which
>> is not a viable solution for production.
>>
>> I am using Accumulo 1.5.0.  Tservers have 12G of RAM and index caching,
>> CF bloom filters, and groups are turned on for the table in question.  Any
>> ideas why this might be happening?
>>
>> Thanks,
>> Anthony
>>
>
>

Re: Bulk ingest losing tablet server

Reply via email to