Hi Danijel, Is the issue resolved in any version of the storm?
Regards Tarkeshwar On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi <[email protected]> wrote: > I've filled a bug report for this under > https://issues.apache.org/jira/browse/STORM-406 > > The issue is 100% reproducible with, it seems, any Trident topology and > across multiple Storm versions with Netty transport enabled. 0MQ is working > fine. You can try with TridentWordCount from storm-starter, for example. > > Your insight seems correct: when the killed worker re-spawns on the same > slot (port), the topology stops processing. See the above JIRA for > additional info. > > Danijel > > > > > On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <[email protected]> > wrote: > >> Thanks Danijel for helping me. >> >> >> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi < >> [email protected]> wrote: >> >>> I see no issues with your cluster configuration. >>> >>> You should definitely share the (simplified if possible) topology >>> code and the steps to reproduce the blockage, better yet you should file a >>> JIRA task on Apache's JIRA web -- be sure to include your Trident >>> internals modifications. >>> >>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2 >>> too, so I might get back here with some updates soon. It's not so fast >>> and easily reproducible as it was under 0.9.1, but the bug >>> seems nonetheless still present. I'll reduce the number of Storm slots and >>> topology workers as per your insights, hopefully this might make it easier >>> to reproduce the bug with a simplified Trident topology. >>> >>> >>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <[email protected]> >>> wrote: >>> >>>> Hi Denijel, >>>> >>>> We have done few changes in the the trident core framework code as per >>>> our need which is working fine with zeromq. I am sharing configuration >>>> which we are using. Can you please suggest our config is fine or not? >>>> >>>> Code part is so large so we are writing some sample topology and >>>> trying to reproduce the issue, which we will share with you. >>>> >>>> What are the steps to reproduce the issue: >>>> ------------------------------------------------------------- >>>> >>>> 1. we deployed our topology with one linux machine, two workers and one >>>> acker with batch size 2. >>>> 2. both the worker are up and start the processing. >>>> 3. after few seconds i killed one of the worker kill -9. >>>> 4. when the killed worker spawned on the same port it is getting hanged. >>>> 5. only retries going on. >>>> 6. when the killed worker spawned on the another port everything >>>> working fine. >>>> >>>> machine conf: >>>> -------------------------- >>>> [root@sb6270x1637-2 conf]# uname -a >>>> >>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43 >>>> EST 2014 x86_64 x86_64 x86_64 GNU/Linux >>>> >>>> >>>> *storm.yaml* which we are using to launch nimbus, supervisor and ui >>>> >>>> ########## These MUST be filled in for a storm configuration >>>> storm.zookeeper.servers: >>>> - "10.61.244.86" >>>> storm.zookeeper.port: 2000 >>>> supervisor.slots.ports: >>>> - 6788 >>>> - 6789 >>>> - 6800 >>>> - 6801 >>>> - 6802 >>>> - 6803 >>>> >>>> nimbus.host: "10.61.244.86" >>>> >>>> >>>> storm.messaging.transport: "backtype.storm.messaging.netty.Context" >>>> >>>> storm.messaging.netty.server_worker_threads: 10 >>>> storm.messaging.netty.client_worker_threads: 10 >>>> storm.messaging.netty.buffer_size: 5242880 >>>> storm.messaging.netty.max_retries: 100 >>>> storm.messaging.netty.max_wait_ms: 1000 >>>> storm.messaging.netty.min_wait_ms: 100 >>>> storm.local.dir: "/root/home_98/home/enavgoy/storm-local" >>>> storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler" >>>> topology.acker.executors: 1 >>>> topology.message.timeout.secs: 30 >>>> supervisor.scheduler.meta: >>>> name: "supervisor1" >>>> >>>> >>>> worker.childopts: "-Xmx2048m" >>>> >>>> mm.hdfs.ipaddress: "10.61.244.7" >>>> mm.hdfs.port: 9000 >>>> topology.batch.size: 2 >>>> topology.batch.timeout: 10000 >>>> topology.workers: 2 >>>> topology.debug: true >>>> >>>> Regards >>>> Tarkeshwar >>>> >>>> >>>> >>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi < >>>> [email protected]> wrote: >>>> >>>>> Hi Tarkeshwar, >>>>> >>>>> Could you provide a code sample of your topology? Do you have any >>>>> special configs enabled? >>>>> >>>>> Thanks, >>>>> >>>>> Danijel >>>>> >>>>> >>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Danijel, >>>>>> >>>>>> We are able to reproduce this issue with 0.9.2 as well. >>>>>> We have two worker setup to run the trident topology. >>>>>> >>>>>> When we kill one of the worker and again when that killed worker >>>>>> spawn on same port(same slot) then that worker not able to communicate >>>>>> with >>>>>> 2nd worker. >>>>>> >>>>>> only transaction attempts are increasing continuously. >>>>>> >>>>>> But if the killed worker spawn on new slot(new communication port) >>>>>> then it working fine. Same behavior as in storm 9.0.1. >>>>>> >>>>>> Please update me if you get any new development. >>>>>> >>>>>> Regards >>>>>> Tarkeshwar >>>>>> >>>>>> >>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Bobby, >>>>>>> >>>>>>> Just an update on the stuck Trident transactional topology issue -- >>>>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and >>>>>>> can't >>>>>>> reproduce the bug anymore. Will keep you posted if any issues arise. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Danijel >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I have not seen this before, if you could file a JIRA on this >>>>>>>> that would be great. >>>>>>>> >>>>>>>> - Bobby >>>>>>>> >>>>>>>> From: Danijel Schiavuzzi <[email protected]> >>>>>>>> Reply-To: "[email protected]" < >>>>>>>> [email protected]> >>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM >>>>>>>> To: "[email protected]" < >>>>>>>> [email protected]>, "[email protected]" >>>>>>>> <[email protected]> >>>>>>>> Subject: Trident transactional topology stuck re-emitting batches >>>>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck) >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I've managed to reproduce the stuck topology problem and it seems >>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled >>>>>>>> now and >>>>>>>> I haven't been able to reproduce this. >>>>>>>> >>>>>>>> The problem is basically a Trident/Kafka transactional topology >>>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. >>>>>>>> This >>>>>>>> happens after the Storm workers restart a few times due to Kafka spout >>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout >>>>>>>> timing out with a SocketTimeoutException due to some temporary network >>>>>>>> problems). Sometimes the topology is stuck after just one worker is >>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger >>>>>>>> the >>>>>>>> problem. >>>>>>>> >>>>>>>> I simulated the Kafka spout socket timeouts by blocking network >>>>>>>> access from Storm to my Kafka machines (with an iptables firewall >>>>>>>> rule). >>>>>>>> Most of the time the spouts (workers) would restart normally (after >>>>>>>> re-enabling access to Kafka) and the topology would continue to process >>>>>>>> batches, but sometimes the topology would get stuck re-emitting batches >>>>>>>> after the crashed workers restarted. Killing and re-submitting the >>>>>>>> topology >>>>>>>> manually fixes this always, and processing continues normally. >>>>>>>> >>>>>>>> I haven't been able to reproduce this scenario after reverting my >>>>>>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost >>>>>>>> always reproduce the problem by causing a worker to restart a number of >>>>>>>> times (only about 4-5 worker restarts are enough to trigger this). >>>>>>>> >>>>>>>> Any hints on this? Anyone had the same problem? It does seem a >>>>>>>> serious issue as it affect the reliability and fault tolerance of the >>>>>>>> Storm >>>>>>>> cluster. >>>>>>>> >>>>>>>> In the meantime, I'll try to prepare a reproducible test case for >>>>>>>> this. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Danijel >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> To (partially) answer my own question -- I still have no idea on >>>>>>>>> the cause of the stuck topology, but re-submitting the topology helps >>>>>>>>> -- >>>>>>>>> after re-submitting my topology is now running normally. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Also, I did have multiple cases of my IBackingMap workers dying >>>>>>>>>> (because of RuntimeExceptions) but successfully restarting >>>>>>>>>> afterwards (I >>>>>>>>>> throw RuntimeExceptions in the BackingMap implementation as my >>>>>>>>>> strategy in >>>>>>>>>> rare SQL database deadlock situations to force a worker restart and >>>>>>>>>> to >>>>>>>>>> fail+retry the batch). >>>>>>>>>> >>>>>>>>>> From the logs, one such IBackingMap worker death (and >>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the >>>>>>>>>> pending >>>>>>>>>> tuple: >>>>>>>>>> >>>>>>>>>> 2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] >>>>>>>>>> re-emitting batch, attempt 29698959:736 >>>>>>>>>> >>>>>>>>>> This is of course the normal behavior of a transactional >>>>>>>>>> topology, but this is the first time I've encountered a case of a >>>>>>>>>> batch >>>>>>>>>> retrying indefinitely. This is especially suspicious since the >>>>>>>>>> topology has >>>>>>>>>> been running fine for 20 days straight, re-emitting batches and >>>>>>>>>> restarting >>>>>>>>>> IBackingMap workers quite a number of times. >>>>>>>>>> >>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch >>>>>>>>>> with the exact txid value 29698959 has been committed -- but I >>>>>>>>>> suspect that >>>>>>>>>> could come from another BackingMap, since there are two BackingMap >>>>>>>>>> instances running (paralellismHint 2). >>>>>>>>>> >>>>>>>>>> However, I have no idea why the batch is being retried >>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by >>>>>>>>>> Trident. >>>>>>>>>> >>>>>>>>>> Any suggestions on the area (topology component) to focus my >>>>>>>>>> research on? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> I'm having problems with my transactional Trident topology. It >>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck >>>>>>>>>>> processing a >>>>>>>>>>> single batch, with no tuples being emitted nor tuples being >>>>>>>>>>> persisted by >>>>>>>>>>> the TridentState (IBackingMap). >>>>>>>>>>> >>>>>>>>>>> It's a simple topology which consumes messages off a Kafka >>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus >>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql >>>>>>>>>>> transactional >>>>>>>>>>> TridentState implementation to persistentAggregate() data into a SQL >>>>>>>>>>> database. >>>>>>>>>>> >>>>>>>>>>> In Zookeeper I can see Storm is re-trying a batch, i.e. >>>>>>>>>>> >>>>>>>>>>> "/transactional/<myTopologyName>/coordinator/currattempts" >>>>>>>>>>> is "{"29698959":6487}" >>>>>>>>>>> >>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch >>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper keeps >>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident and >>>>>>>>>>> I have >>>>>>>>>>> no idea why, especially since the topology has been running >>>>>>>>>>> successfully >>>>>>>>>>> the last 20 days. >>>>>>>>>>> >>>>>>>>>>> I did rebalance the topology on one occasion, after which it >>>>>>>>>>> continued running normally. Other than that, no other modifications >>>>>>>>>>> were >>>>>>>>>>> done. Storm is at version 0.9.0.1. >>>>>>>>>>> >>>>>>>>>>> Any hints on how to debug the stuck topology? Any other useful >>>>>>>>>>> info I might provide? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>>> >>>>>>>>>>> E: [email protected] >>>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>>> T: +385989035562 >>>>>>>>>>> Skype: danijel.schiavuzzi >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>> >>>>>>>>>> E: [email protected] >>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>> T: +385989035562 >>>>>>>>>> Skype: danijel.schiavuzzi >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Danijel Schiavuzzi >>>>>>>>> >>>>>>>>> E: [email protected] >>>>>>>>> W: www.schiavuzzi.com >>>>>>>>> T: +385989035562 >>>>>>>>> Skype: danijels7 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Danijel Schiavuzzi >>>>>>>> >>>>>>>> E: [email protected] >>>>>>>> W: www.schiavuzzi.com >>>>>>>> T: +385989035562 >>>>>>>> Skype: danijels7 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Danijel Schiavuzzi >>>>>>> >>>>>>> E: [email protected] >>>>>>> W: www.schiavuzzi.com >>>>>>> T: +385989035562 >>>>>>> Skype: danijels7 >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Danijel Schiavuzzi >>>>> >>>>> E: [email protected] >>>>> W: www.schiavuzzi.com >>>>> T: +385989035562 >>>>> Skype: danijels7 >>>>> >>>> >>>> >>> >>> -- >>> Danijel Schiavuzzi >>> >>> E: [email protected] >>> W: www.schiavuzzi.com >>> T: +385989035562 >>> Skype: danijels7 >>> >> >> > > > -- > Danijel Schiavuzzi > > E: [email protected] > W: www.schiavuzzi.com > T: +385989035562 > Skype: danijels7 >
