Thank you for the responses. The number of cpus has been something I have considered. The worker nodes only have 4 cpus. The YARN processes are running on the same nodes as the tablet servers.
On another cloud with 8 cpus for each worker, we have been able to run 10 YARN processes with 2gb memory each. Even though this configuration thrashes the workers (I have seen OS loads over 20), the tablet servers stay up. I was worried about how many connections would be open on the larger cloud, so I significantly reduced the number of YARN process. Side question: does each worker node have a connection with every other node? If they did, my guess was that there would be significantly more open connections on a 150+ node cloud than a 40 node cloud. For that reason, I only have 2 YARN processes with 2gb memory each on the larger cloud that is seeing the issues. My thought was that each YARN process needs a core, the tablet server needs a core, and OS stuff could probably use a core. Is there a more elegant way to see if the tablet server is being pushed into swap or starved of CPU other than just watching top during the YARN job? I did look into zookeeper loads a little bit, but I would be a little surprised to see issues there as the zookeeper nodes on the big cloud (1cpu, 8gb ram) have significantly more ram than the zookeepers on the smaller cloud (1cpu, 1gb ram). I did up the GC memory limit for Accumulo gc as I was seeing issues there early on. What is the best way to check the iowait times for the ZK transaction log? -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-Last-Contact-times-tp9950p9962.html Sent from the Users mailing list archive at Nabble.com.
