Storm Topology hangs

Dharin Maniar Tue, 03 Nov 2015 21:20:57 -0800

Hi,

Not sure If this is the correct way to post my question, If not, please
direct me to the correct place where I can submit my query.


We have been using storm for quiet sometime. We upgraded from 0.9.3 to
0.9.5 (to overcome workers crashing in a cascading manner issue).

In 0.9.5, we found the earlier issue was resolved but we started facing the
issue where If a worker dies, then topology wasnt able to recover from it.
and suddenly tuple execute count dropped. Looking at the logs it was clear
that worker was trying to reconnect and was getting netty.client issue. and
we had to manually restart for topology to work (on restart, all worked
well but for few hours and topology clogged)
upon more research, realized that our issue was very close to,

https://github.com/apache/storm/pull/566  and we had errors like,
=====
ERROR] [Thread-10-disruptor-worker-transfer-queue] b.s.m.n.Client dropping
1 message(s) destined for
Netty-Client-ip-172-18-0-207.us-west-2.compute.internal/172.18.0.207:6702
2015-10-22T17:21:58.705+0000 [INFO] [client-schedule-service-10]
b.s.m.n.Client connection established to
Netty-Client-ip-172-18-0-207.us-west-2.compute.internal/172.18.0.207:6702
=====

To overcome above problem, we upgraded to 0.10.0.beta1. Topology now looks
better but the problem is not completely solved. few times when workers
fail, topology is able to recover within 10-15 minutes. and sometimes we
get the behavior as earlier.

Around the time when workers die, logs usually say,
ERROR Policies has no parameter that matches element DefaultRolloverStrategy

and corresponding commit was found in github,
https://github.com/apache/storm/pull/638/files

My question is, why we still see the same problem, although log4j policy
issue seems irrelevant but will above patch solve the problem, has anyone
faced similar situation and what could be the cause. any help is greatly
appreciated.

Thanks,
Dharin.

Storm Topology hangs

Reply via email to