I'm responsible for a topology of 12 EC2 instances, running a total of ~2500 executors across 81 workers. Recently we increased the number of executors, and the Zookeeper instance dedicated to this Storm cluster has started falling over because its small disk is exhausted by logs. This is, of course, tractable by increasing the disk space available to Zookeeper, but I'd like to see if we can find a cleaner solution. We're already cleaning logs hourly to the standard minimum of 3 snapshots, but it's not enough.
What are the adverse effects, if any, of increasing task.heartbeat.frequency.secs from the default value of 3? Based on my reading of the Storm source, increasing it should linearly reduce the rate of setData events to Zookeeper, and in turn the rate of accumulation of logs on disk. Are there timeouts we need to be careful of violating by reducing the frequency of heartbeats from executors? -- Eric Allen Software Engineer | www.adroll.com <http://www.google.com/url?q=http%3A%2F%2Fwww.adroll.com%2F&sa=D&sntz=1&usg=AFrqEzfbgqVT4nqZBiJYAZ59pVVdbrPWiw> | 408.228.7180 *SF Business Times: *AdRoll named a "Best Place To Work <http://www.google.com/url?q=http%3A%2F%2Fblog.adroll.com%2Fbest-places-to-work-sf-biz-times&sa=D&sntz=1&usg=AFrqEzdN43WQ2Jmsm96ucT5fTQhPmKr5PA> " Two Years in a Row The Retargeting Playbook <http://www.google.com/url?q=http%3A%2F%2Fwww.adroll.com%2Fresources%2Fthe_retargeting_playbook&sa=D&sntz=1&usg=AFrqEzeY53Ecvkbiz8sQDunSPPBNFbggBA>, now available in stores and online
