Hello all, I am having issues with tablet servers going down due to poor contact times (my hypothesis at least). In the past I have had stability success with smaller clouds (20-40 nodes), but have run into issues with a larger number of nodes (150+). Each node is a datanode, nodemanger, and tablet server. There is a master node that is running the hadoop namenode, hadoop resource manager and accumulo master, monitor, etc. There are three zookeeper nodes. All nodes are vms. This same setup is used on the smaller, stable clouds as well.
I do not believe memory allocation is an issue as I have only given hadoop/yarn (2.2.0) and accumulo (1.5.1) less than half of the available memory. The FATAL errors I have seen are: Lost tablet server lock (resaon = SESSION_EXPIRED), exiting Lost ability to monitor tablet server lock, exiting Other than bumping up rpc timeout (which I have done but would rather not do that and find the root cause of the problem), I have run out of ideas on how to solve this issue. Does anyone have any insight into why I would be seeing such bad response times between nodes? Are there any configuration parameters I can play with to fix this? I realize this is a very general question, so let me know if there is any information I can provide to help clarify the issue. Thank you in advance for your time. Thomas -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-Last-Contact-times-tp9950.html Sent from the Users mailing list archive at Nabble.com.
