Hi, Not sure If this is the correct way to post my question, If not, please direct me to the correct place where I can submit my query.
We have been using storm for quiet sometime. We upgraded from 0.9.3 to 0.9.5 (to overcome workers crashing in a cascading manner issue). In 0.9.5, we found the earlier issue was resolved but we started facing the issue where If a worker dies, then topology wasnt able to recover from it. and suddenly tuple execute count dropped. Looking at the logs it was clear that worker was trying to reconnect and was getting netty.client issue. and we had to manually restart for topology to work (on restart, all worked well but for few hours and topology clogged) upon more research, realized that our issue was very close to, https://github.com/apache/storm/pull/566 and we had errors like, ===== ERROR] [Thread-10-disruptor-worker-transfer-queue] b.s.m.n.Client dropping 1 message(s) destined for Netty-Client-ip-172-18-0-207.us-west-2.compute.internal/172.18.0.207:6702 2015-10-22T17:21:58.705+0000 [INFO] [client-schedule-service-10] b.s.m.n.Client connection established to Netty-Client-ip-172-18-0-207.us-west-2.compute.internal/172.18.0.207:6702 ===== To overcome above problem, we upgraded to 0.10.0.beta1. Topology now looks better but the problem is not completely solved. few times when workers fail, topology is able to recover within 10-15 minutes. and sometimes we get the behavior as earlier. Around the time when workers die, logs usually say, ERROR Policies has no parameter that matches element DefaultRolloverStrategy and corresponding commit was found in github, https://github.com/apache/storm/pull/638/files My question is, why we still see the same problem, although log4j policy issue seems irrelevant but will above patch solve the problem, has anyone faced similar situation and what could be the cause. any help is greatly appreciated. Thanks, Dharin.
