On Mon, May 19, 2014 at 6:56 PM, <[email protected]> wrote: > You are hitting the zookeeper timeout, default 30s I believe. You said you > are not oversubscribed for memory, but what about CPU? Are you running YARN > processes on the same nodes as the tablet servers? Is the tablet server > being pushed into swap or starved of CPU? >
Also check on the zookeeper server nodes. Is Java GC pausing tservers or zookeeper servers? > > -----Original Message----- > From: thomasa [mailto:[email protected]] > Sent: Monday, May 19, 2014 4:22 PM > To: [email protected] > Subject: Losing tservers - Unusually high Last Contact times > > Hello all, > > I am having issues with tablet servers going down due to poor contact times > (my hypothesis at least). In the past I have had stability success with > smaller clouds (20-40 nodes), but have run into issues with a larger number > of nodes (150+). Each node is a datanode, nodemanger, and tablet server. > There is a master node that is running the hadoop namenode, hadoop resource > manager and accumulo master, monitor, etc. There are three zookeeper nodes. > All nodes are vms. This same setup is used on the smaller, stable clouds as > well. > > I do not believe memory allocation is an issue as I have only given > hadoop/yarn (2.2.0) and accumulo (1.5.1) less than half of the available > memory. The FATAL errors I have seen are: > > Lost tablet server lock (resaon = SESSION_EXPIRED), exiting > > Lost ability to monitor tablet server lock, exiting > > Other than bumping up rpc timeout (which I have done but would rather not > do > that and find the root cause of the problem), I have run out of ideas on > how > to solve this issue. > > Does anyone have any insight into why I would be seeing such bad response > times between nodes? Are there any configuration parameters I can play with > to fix this? > > I realize this is a very general question, so let me know if there is any > information I can provide to help clarify the issue. > > Thank you in advance for your time. > > Thomas > > > > -- > View this message in context: > > http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high- > Last-Contact-times-tp9950.html > Sent from the Users mailing list archive at Nabble.com. > >
