Hi Ravi, Kyle, thanks for the input!

I tried increasing task timeout from 30 to 60 seconds - and still observed
the same issue. Increasing the timeout further does not look reasonable,
since it will affect Nimbus ability to detect real crashes.

I was looking at Zookeeper metrics and haven't noticed any anomalies - no
load spikes around the point of heartbeat timeout. I will double check,
however.

Kyle,

Could you elaborate a bit on what the issue with Zookeeper looked like in
your case? Was it simply that write call to Zookeeper at times blocked for
more than nimbus.task.timeout.secs?

2015-12-16 21:53 GMT+03:00 Kyle Nusbaum <[email protected]>:

> Yes, I would check Zookeeper.
> We've seen the exact same thing in large clusters, which is what this was
> designed to help solve: https://issues.apache.org/jira/browse/STORM-885
>
> -- Kyle
>
>
>
> On Monday, December 14, 2015 8:45 PM, Ravi Tandon <
> [email protected]> wrote:
>
>
> Try the following:
>
> ·        Increase the value of "nimbus.monitor.freq.secs"="120", this
> will make nimbus to wait longer before declaring a worker dead. Also check
> other configs like “supervisor.worker.timeout.secs“ that will allow the
> system to wait longer before the re-assignment/re-launching workers.
> ·        Check the write load on the Zookeepers too, that maybe the
> bottleneck of your cluster and the co-ordination thereof than the worker
> nodes themselves. You can choose to have additional ZK nodes or provide
> better spec machines for the quorum.
>
> -Ravi
>
> *From:* Yury Ruchin [mailto:[email protected]]
> *Sent:* Sunday, December 13, 2015 4:22 AM
> *To:* [email protected]
> *Subject:* Cascading "not alive" in topology with Storm 0.9.5
>
> Hello,
>
> I'm running a large topology using Storm 0.9.5. I have 2.5K executors
> distributed over 60 workers, 4-5 workers per node. The topology consumes
> data from Kafka spout.
>
> I regularly observe Nimbus considering topology workers dead by heartbeat
> timeout. It then moves executors to other workers, but soon another worker
> times out. Nimbus moves its executors and so on. The sequence repeats over
> and over - in fact, there are cascading worker timeouts in topology which
> it cannot restore from.The topology itself looks alive but stops consuming
> from Kafka and as the result stops processing altogether.
>
> I didn't see any obvious issues with network, so initially I assumed there
> might be worker process failures caused by exceptions/errors inside the
> process, e. g. OOME. Nothing appeared in worker logs. I then found that the
> processes were actually alive when Nimbus declared them dead - it seems
> like they simply stopped sending heartbeats for some reason.
>
> I looked for Java fatal error logs in assumption that the error might be
> caused by some nasty low-level things happening - but found nothing.
>
> I suspected high CPU usage, but it turned out the user CPU + system CPU on
> the nodes never went above 50-60% in peaks. The regular load was even less.
>
> I was observing the same issue with Storm 0.9.3, then upgraded to Storm
> 0.9.5 hoping that fixes for
> https://issues.apache.org/jira/browse/STORM-329
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-329&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=j7KqlX9nKf7abFTWur0lsIeXNBZUXwCCga7X1Mei7yY%3d>
> and https://issues.apache.org/jira/browse/STORM-404
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-404&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=1iDLe2Jr5qZAmuiOYXJomzqdX5G3XqZDFPSkP4wOt2g%3d>
> will help. But they haven't.
>
> Strange enough, I can only reproduce the issue in this large setup. Small
> test setups with 2 workers do not expose this issue - even after killing
> all worker processes by kill -9 they restore seamlessly.
>
> My other guess is that large number of workers causes significant overhead
> on establishing Netty connections during worker startup which somehow
> prevents heartbeats from being sent. Maybe this is something similar to
> https://issues.apache.org/jira/browse/STORM-763
> <https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-763&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=mWe7i%2bVejDHainxeYwaybylchyhPisCwT3q6skqTIl0%3d>
> and it's worth upgrading to 0.9.6 - I don't know how to check it.
>
> Any help is appreciated.
>
>
>
>
>

Reply via email to