Re: Losing tservers - Unusually high Last Contact times

thomasa Tue, 20 May 2014 07:22:25 -0700

Thank you for the responses. The number of cpus has been something I have
considered. The worker nodes only have 4 cpus. The YARN processes are
running on the same nodes as the tablet servers.

On another cloud with 8 cpus for each worker, we have been able to run 10
YARN processes with 2gb memory each. Even though this configuration thrashes
the workers (I have seen OS loads over 20), the tablet servers stay up.

I was worried about how many connections would be open on the larger cloud,
so I significantly reduced the number of YARN process. Side question: does
each worker node have a connection with every other node? If they did, my
guess was that there would be significantly more open connections on a 150+
node cloud than a 40 node cloud. For that reason, I only have 2 YARN
processes with 2gb memory each on the larger cloud that is seeing the
issues. My thought was that each YARN process needs a core, the tablet
server needs a core, and OS stuff could probably use a core.

Is there a more elegant way to see if the tablet server is being pushed into
swap or starved of CPU other than just watching top during the YARN job?

I did look into zookeeper loads a little bit, but I would be a little
surprised to see issues there as the zookeeper nodes on the big cloud (1cpu,
8gb ram) have significantly more ram than the zookeepers on the smaller
cloud (1cpu, 1gb ram). I did up the GC memory limit for Accumulo gc as I was
seeing issues there early on.

What is the best way to check the iowait times for the ZK transaction log?

--
View this message in context:
http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-Last-Contact-times-tp9950p9962.html
Sent from the Users mailing list archive at Nabble.com.

Re: Losing tservers - Unusually high Last Contact times

Reply via email to