Hi Ravi, Kyle, thanks for the input! I tried increasing task timeout from 30 to 60 seconds - and still observed the same issue. Increasing the timeout further does not look reasonable, since it will affect Nimbus ability to detect real crashes.
I was looking at Zookeeper metrics and haven't noticed any anomalies - no load spikes around the point of heartbeat timeout. I will double check, however. Kyle, Could you elaborate a bit on what the issue with Zookeeper looked like in your case? Was it simply that write call to Zookeeper at times blocked for more than nimbus.task.timeout.secs? 2015-12-16 21:53 GMT+03:00 Kyle Nusbaum <[email protected]>: > Yes, I would check Zookeeper. > We've seen the exact same thing in large clusters, which is what this was > designed to help solve: https://issues.apache.org/jira/browse/STORM-885 > > -- Kyle > > > > On Monday, December 14, 2015 8:45 PM, Ravi Tandon < > [email protected]> wrote: > > > Try the following: > > · Increase the value of "nimbus.monitor.freq.secs"="120", this > will make nimbus to wait longer before declaring a worker dead. Also check > other configs like “supervisor.worker.timeout.secs“ that will allow the > system to wait longer before the re-assignment/re-launching workers. > · Check the write load on the Zookeepers too, that maybe the > bottleneck of your cluster and the co-ordination thereof than the worker > nodes themselves. You can choose to have additional ZK nodes or provide > better spec machines for the quorum. > > -Ravi > > *From:* Yury Ruchin [mailto:[email protected]] > *Sent:* Sunday, December 13, 2015 4:22 AM > *To:* [email protected] > *Subject:* Cascading "not alive" in topology with Storm 0.9.5 > > Hello, > > I'm running a large topology using Storm 0.9.5. I have 2.5K executors > distributed over 60 workers, 4-5 workers per node. The topology consumes > data from Kafka spout. > > I regularly observe Nimbus considering topology workers dead by heartbeat > timeout. It then moves executors to other workers, but soon another worker > times out. Nimbus moves its executors and so on. The sequence repeats over > and over - in fact, there are cascading worker timeouts in topology which > it cannot restore from.The topology itself looks alive but stops consuming > from Kafka and as the result stops processing altogether. > > I didn't see any obvious issues with network, so initially I assumed there > might be worker process failures caused by exceptions/errors inside the > process, e. g. OOME. Nothing appeared in worker logs. I then found that the > processes were actually alive when Nimbus declared them dead - it seems > like they simply stopped sending heartbeats for some reason. > > I looked for Java fatal error logs in assumption that the error might be > caused by some nasty low-level things happening - but found nothing. > > I suspected high CPU usage, but it turned out the user CPU + system CPU on > the nodes never went above 50-60% in peaks. The regular load was even less. > > I was observing the same issue with Storm 0.9.3, then upgraded to Storm > 0.9.5 hoping that fixes for > https://issues.apache.org/jira/browse/STORM-329 > <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-329&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=j7KqlX9nKf7abFTWur0lsIeXNBZUXwCCga7X1Mei7yY%3d> > and https://issues.apache.org/jira/browse/STORM-404 > <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-404&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=1iDLe2Jr5qZAmuiOYXJomzqdX5G3XqZDFPSkP4wOt2g%3d> > will help. But they haven't. > > Strange enough, I can only reproduce the issue in this large setup. Small > test setups with 2 workers do not expose this issue - even after killing > all worker processes by kill -9 they restore seamlessly. > > My other guess is that large number of workers causes significant overhead > on establishing Netty connections during worker startup which somehow > prevents heartbeats from being sent. Maybe this is something similar to > https://issues.apache.org/jira/browse/STORM-763 > <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-763&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=mWe7i%2bVejDHainxeYwaybylchyhPisCwT3q6skqTIl0%3d> > and it's worth upgrading to 0.9.6 - I don't know how to check it. > > Any help is appreciated. > > > > >
