On 5/20/14, 10:21 AM, thomasa wrote:
I was worried about how many connections would be open on the larger cloud, so I significantly reduced the number of YARN process. Side question: does each worker node have a connection with every other node?
Are you referring to the YARN processes or Accumulo processes? For YARN, I believe the container will primarily be communicating back to the RM for MapReduce, but a custom app could be doing anything.
For Accumulo, mostly, a tserver will be only communicating with the master. I know this isn't entirely true, though. For examples, tservers will communicate with other tservers as a part of bulk-importing.
If they did, my
guess was that there would be significantly more open connections on a 150+ node cloud than a 40 node cloud. For that reason, I only have 2 YARN processes with 2gb memory each on the larger cloud that is seeing the issues. My thought was that each YARN process needs a core, the tablet server needs a core, and OS stuff could probably use a core.
Yes, you should most definitely be leaving headroom on a system for the operating system. A core and 1G of RAM is probably a good starting point, but YMMV.
To increase the zookeeper timeout, you can try this, but it will have other implications, such a failure detection/recovery being slower:
In accumulo-site.xml: set instance.zookeeper.timeout equal to something like 45s or 60s (default is 30s as Dave mentioned earlier).
In zoo.cfg: set maxSessionTimeout equal to the above, but in milliseconds, e.g. 45000 or 60000.
