Losing tservers - Unusually high Last Contact times

thomasa Mon, 19 May 2014 13:22:25 -0700

Hello all,

I am having issues with tablet servers going down due to poor contact times
(my hypothesis at least). In the past I have had stability success with
smaller clouds (20-40 nodes), but have run into issues with a larger number
of nodes (150+). Each node is a datanode, nodemanger, and tablet server.
There is a master node that is running the hadoop namenode, hadoop resource
manager and accumulo master, monitor, etc. There are three zookeeper nodes.
All nodes are vms. This same setup is used on the smaller, stable clouds as
well.


I do not believe memory allocation is an issue as I have only given
hadoop/yarn (2.2.0) and accumulo (1.5.1) less than half of the available
memory. The FATAL errors I have seen are:

Lost tablet server lock (resaon = SESSION_EXPIRED), exiting

Lost ability to monitor tablet server lock, exiting

Other than bumping up rpc timeout (which I have done but would rather not do
that and find the root cause of the problem), I have run out of ideas on how
to solve this issue. 

Does anyone have any insight into why I would be seeing such bad response
times between nodes? Are there any configuration parameters I can play with
to fix this?

I realize this is a very general question, so let me know if there is any
information I can provide to help clarify the issue.

Thank you in advance for your time.

Thomas



--
View this message in context: 
http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-Last-Contact-times-tp9950.html
Sent from the Users mailing list archive at Nabble.com.

Losing tservers - Unusually high Last Contact times

Reply via email to