Thanks Steve, Enno, Martin. Only common thing between teh worker was the gc logs that I configured. I dont find anything else. After i made the changes there, what I also is that spout stops consuming and there are no crashes of workers too. It just stops and nothing happens.
I think it has to do with the number of messages being sent into the system. If I keep the message level low (adjust maxx spout pending), then the topology is up for 90 mins and counting. Otherwise, the system crashed in 15 mins. What I was expecting was that the topology crashes and then restarts, but that is exactly what was not happening. i tried it in 0.10.0-beta1 too and i found the same behavior. The last prod version i had was 0.9.0-wip16 and there the 0mq was used. I did not find issues there though. THanks kashyap On Sep 13, 2015 15:39, "Stephen Powis" <[email protected]> wrote: > Kashyap - I see this same issue on 0.9.5 > > On Sun, Sep 13, 2015 at 9:58 AM, Enno Shioji <[email protected]> wrote: > >> There was a change in that area in 0.9.6 ( >> https://issues.apache.org/jira/browse/STORM-763), although I'm not sure >> if it will help your issue. >> >> >> On Sun, Sep 13, 2015 at 2:35 PM, Kashyap Mhaisekar <[email protected]> >> wrote: >> >>> Hmm. Thanks for the lead. On storm UI, the uptime for each executor >>> except spout shows pretty much consistent values. Spout has crashed for >>> sure. But then never comes up. Will check this up again. >>> >>> But the other question is - Is the Netty reconnects issue solved in >>> 0.9.5? What is your storm version? >>> >>> Thanks >>> Kashyap >>> On Sep 13, 2015 08:04, "Martin Burian" <[email protected]> >>> wrote: >>> >>>> They do restart after a while, yes. But if you don't see any error in >>>> the log, it's weird. I encountered a case of workers not starting because I >>>> configured the worker JVM to expose JMX interface for remote monitoring on >>>> a given port. Other workers on the same machine however could not start as >>>> they failed to bind to the already used port. No error messages whatsoever. >>>> Might any such thing be your case? >>>> >>>> Othervise the cause should be logged somewhere. A worker is definitely >>>> not running, or at least talking to the supervisor. You could try using >>>> less workers to find out when/where the error occurs. >>>> >>>> Martin >>>> >>>> ne 13. 9. 2015 v 13:43 odesÃlatel Kashyap Mhaisekar < >>>> [email protected]> napsal: >>>> >>>>> All worker logs have the same log. Workers are up. I am using only one >>>>> box with multiple workers to test. >>>>> Workers should be restarted of they fail right? So ideally, this error >>>>> should be gone in a while.. >>>>> >>>>> Thanks >>>>> >>>>> >>>>> Kashyap >>>>> On Sep 13, 2015 05:10, "Martin Burian" <[email protected]> >>>>> wrote: >>>>> >>>>>> When this appears in worker log, it means that the worker is trying >>>>>> to connect to another worker, but the other is not running. What do you >>>>>> see >>>>>> in worker-6707.log? Is the other worker runing? >>>>>> Matrin >>>>>> >>>>>> ne 13. 9. 2015 v 6:06 odesÃlatel Kashyap Mhaisekar < >>>>>> [email protected]> napsal: >>>>>> >>>>>>> Also, >>>>>>> Is there a way to switch back to 0mq from Netty? If so, what needs >>>>>>> to be done? >>>>>>> >>>>>>> Thanks >>>>>>> kashyap >>>>>>> >>>>>>> On Sat, Sep 12, 2015 at 10:49 PM, Kashyap Mhaisekar < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Am having a Netty related issues in my storm cluster where the >>>>>>>> spout stops consuming after a while. The corresponding worker logs >>>>>>>> show - >>>>>>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [ERROR] connection >>>>>>>> attempt 26 to >>>>>>>> Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707 >>>>>>>> <http://Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707> >>>>>>>> failed: java.lang.RuntimeException: Returned channel was actually not >>>>>>>> established* >>>>>>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [INFO] connection >>>>>>>> attempt 27 to Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707 >>>>>>>> <http://Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707> scheduled >>>>>>>> to >>>>>>>> run in 392 ms* >>>>>>>> *2015-09-12T23:28:23.784-0400 b.s.m.n.Client [ERROR] connection >>>>>>>> attempt 27 to Netty-Client-**serverstorm1.myorg.com >>>>>>>> <http://serverstorm1.myorg.com>**/10.2.70.18:6707 >>>>>>>> <http://10.2.70.18:6707> failed: java.lang.RuntimeException: Returned >>>>>>>> channel was actually not established* >>>>>>>> >>>>>>>> The corresponding supervisor logs had >>>>>>>> *2015-09-12T23:28:23.018-0400 b.s.d.supervisor [INFO] >>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started* >>>>>>>> *2015-09-12T23:28:23.518-0400 b.s.d.supervisor [INFO] >>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started* >>>>>>>> *2015-09-12T23:28:24.019-0400 b.s.d.supervisor [INFO] >>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started* >>>>>>>> >>>>>>>> I had storm version 0.9.3 when this issue occurred and had upgraded >>>>>>>> to 0.9.4 and 0.9.5 to seek relief, but the issue still persists. Am not >>>>>>>> sure what else to do. Am not even sure why this issue occurs and what >>>>>>>> triggers it. Any help would be great and appreciated. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Kashyap >>>>>>>> >>>>>>>> >>>>>>> >> >
