In which version it is available. On 16 Sep 2014 19:01, "Danijel Schiavuzzi" <dani...@schiavuzzi.com> wrote:
> Yes, it's been fixed in 'master' for some time now. > > Danijel > > On Tuesday, September 16, 2014, M.Tarkeshwar Rao <tarkeshwa...@gmail.com> > wrote: > >> Hi Danijel, >> >> Is the issue resolved in any version of the storm? >> >> Regards >> Tarkeshwar >> >> On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi < >> dani...@schiavuzzi.com> wrote: >> >>> I've filled a bug report for this under >>> https://issues.apache.org/jira/browse/STORM-406 >>> >>> The issue is 100% reproducible with, it seems, any Trident topology and >>> across multiple Storm versions with Netty transport enabled. 0MQ is working >>> fine. You can try with TridentWordCount from storm-starter, for example. >>> >>> Your insight seems correct: when the killed worker re-spawns on the same >>> slot (port), the topology stops processing. See the above JIRA for >>> additional info. >>> >>> Danijel >>> >>> >>> >>> >>> On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao < >>> tarkeshwa...@gmail.com> wrote: >>> >>>> Thanks Danijel for helping me. >>>> >>>> >>>> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi < >>>> dani...@schiavuzzi.com> wrote: >>>> >>>>> I see no issues with your cluster configuration. >>>>> >>>>> You should definitely share the (simplified if possible) topology >>>>> code and the steps to reproduce the blockage, better yet you should file a >>>>> JIRA task on Apache's JIRA web -- be sure to include your Trident >>>>> internals modifications. >>>>> >>>>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2 >>>>> too, so I might get back here with some updates soon. It's not so fast >>>>> and easily reproducible as it was under 0.9.1, but the bug >>>>> seems nonetheless still present. I'll reduce the number of Storm slots and >>>>> topology workers as per your insights, hopefully this might make it easier >>>>> to reproduce the bug with a simplified Trident topology. >>>>> >>>>> >>>>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <tarkeshwa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Denijel, >>>>>> >>>>>> We have done few changes in the the trident core framework code as >>>>>> per our need which is working fine with zeromq. I am sharing >>>>>> configuration >>>>>> which we are using. Can you please suggest our config is fine or not? >>>>>> >>>>>> Code part is so large so we are writing some sample topology and >>>>>> trying to reproduce the issue, which we will share with you. >>>>>> >>>>>> What are the steps to reproduce the issue: >>>>>> ------------------------------------------------------------- >>>>>> >>>>>> 1. we deployed our topology with one linux machine, two workers and >>>>>> one acker with batch size 2. >>>>>> 2. both the worker are up and start the processing. >>>>>> 3. after few seconds i killed one of the worker kill -9. >>>>>> 4. when the killed worker spawned on the same port it is getting >>>>>> hanged. >>>>>> 5. only retries going on. >>>>>> 6. when the killed worker spawned on the another port everything >>>>>> working fine. >>>>>> >>>>>> machine conf: >>>>>> -------------------------- >>>>>> [root@sb6270x1637-2 conf]# uname -a >>>>>> >>>>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 >>>>>> 14:46:43 EST 2014 x86_64 x86_64 x86_64 GNU/Linux >>>>>> >>>>>> >>>>>> *storm.yaml* which we are using to launch nimbus, supervisor and ui >>>>>> >>>>>> ########## These MUST be filled in for a storm configuration >>>>>> storm.zookeeper.servers: >>>>>> - "10.61.244.86" >>>>>> storm.zookeeper.port: 2000 >>>>>> supervisor.slots.ports: >>>>>> - 6788 >>>>>> - 6789 >>>>>> - 6800 >>>>>> - 6801 >>>>>> - 6802 >>>>>> - 6803 >>>>>> >>>>>> nimbus.host: "10.61.244.86" >>>>>> >>>>>> >>>>>> storm.messaging.transport: "backtype.storm.messaging.netty.Context" >>>>>> >>>>>> storm.messaging.netty.server_worker_threads: 10 >>>>>> storm.messaging.netty.client_worker_threads: 10 >>>>>> storm.messaging.netty.buffer_size: 5242880 >>>>>> storm.messaging.netty.max_retries: 100 >>>>>> storm.messaging.netty.max_wait_ms: 1000 >>>>>> storm.messaging.netty.min_wait_ms: 100 >>>>>> storm.local.dir: "/root/home_98/home/enavgoy/storm-local" >>>>>> storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler" >>>>>> topology.acker.executors: 1 >>>>>> topology.message.timeout.secs: 30 >>>>>> supervisor.scheduler.meta: >>>>>> name: "supervisor1" >>>>>> >>>>>> >>>>>> worker.childopts: "-Xmx2048m" >>>>>> >>>>>> mm.hdfs.ipaddress: "10.61.244.7" >>>>>> mm.hdfs.port: 9000 >>>>>> topology.batch.size: 2 >>>>>> topology.batch.timeout: 10000 >>>>>> topology.workers: 2 >>>>>> topology.debug: true >>>>>> >>>>>> Regards >>>>>> Tarkeshwar >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi < >>>>>> dani...@schiavuzzi.com> wrote: >>>>>> >>>>>>> Hi Tarkeshwar, >>>>>>> >>>>>>> Could you provide a code sample of your topology? Do you have any >>>>>>> special configs enabled? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Danijel >>>>>>> >>>>>>> >>>>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao < >>>>>>> tarkeshwa...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Danijel, >>>>>>>> >>>>>>>> We are able to reproduce this issue with 0.9.2 as well. >>>>>>>> We have two worker setup to run the trident topology. >>>>>>>> >>>>>>>> When we kill one of the worker and again when that killed worker >>>>>>>> spawn on same port(same slot) then that worker not able to communicate >>>>>>>> with >>>>>>>> 2nd worker. >>>>>>>> >>>>>>>> only transaction attempts are increasing continuously. >>>>>>>> >>>>>>>> But if the killed worker spawn on new slot(new communication port) >>>>>>>> then it working fine. Same behavior as in storm 9.0.1. >>>>>>>> >>>>>>>> Please update me if you get any new development. >>>>>>>> >>>>>>>> Regards >>>>>>>> Tarkeshwar >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi < >>>>>>>> dani...@schiavuzzi.com> wrote: >>>>>>>> >>>>>>>>> Hi Bobby, >>>>>>>>> >>>>>>>>> Just an update on the stuck Trident transactional topology issue >>>>>>>>> -- I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and >>>>>>>>> can't reproduce the bug anymore. Will keep you posted if any issues >>>>>>>>> arise. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Danijel >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I have not seen this before, if you could file a JIRA on this >>>>>>>>>> that would be great. >>>>>>>>>> >>>>>>>>>> - Bobby >>>>>>>>>> >>>>>>>>>> From: Danijel Schiavuzzi <dani...@schiavuzzi.com> >>>>>>>>>> Reply-To: "user@storm.incubator.apache.org" < >>>>>>>>>> user@storm.incubator.apache.org> >>>>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM >>>>>>>>>> To: "user@storm.incubator.apache.org" < >>>>>>>>>> user@storm.incubator.apache.org>, "d...@storm.incubator.apache.org" >>>>>>>>>> <d...@storm.incubator.apache.org> >>>>>>>>>> Subject: Trident transactional topology stuck re-emitting >>>>>>>>>> batches with Netty, but running fine with ZMQ (was Re: Topology is >>>>>>>>>> stuck) >>>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I've managed to reproduce the stuck topology problem and it seems >>>>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled >>>>>>>>>> now and >>>>>>>>>> I haven't been able to reproduce this. >>>>>>>>>> >>>>>>>>>> The problem is basically a Trident/Kafka transactional topology >>>>>>>>>> getting stuck, i.e. re-emitting the same batches over and over >>>>>>>>>> again. This >>>>>>>>>> happens after the Storm workers restart a few times due to Kafka >>>>>>>>>> spout >>>>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the >>>>>>>>>> spout >>>>>>>>>> timing out with a SocketTimeoutException due to some temporary >>>>>>>>>> network >>>>>>>>>> problems). Sometimes the topology is stuck after just one worker is >>>>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger >>>>>>>>>> the >>>>>>>>>> problem. >>>>>>>>>> >>>>>>>>>> I simulated the Kafka spout socket timeouts by blocking network >>>>>>>>>> access from Storm to my Kafka machines (with an iptables firewall >>>>>>>>>> rule). >>>>>>>>>> Most of the time the spouts (workers) would restart normally (after >>>>>>>>>> re-enabling access to Kafka) and the topology would continue to >>>>>>>>>> process >>>>>>>>>> batches, but sometimes the topology would get stuck re-emitting >>>>>>>>>> batches >>>>>>>>>> after the crashed workers restarted. Killing and re-submitting the >>>>>>>>>> topology >>>>>>>>>> manually fixes this always, and processing continues normally. >>>>>>>>>> >>>>>>>>>> I haven't been able to reproduce this scenario after reverting >>>>>>>>>> my Storm cluster's transport to ZeroMQ. With Netty transport, I can >>>>>>>>>> almost >>>>>>>>>> always reproduce the problem by causing a worker to restart a number >>>>>>>>>> of >>>>>>>>>> times (only about 4-5 worker restarts are enough to trigger this). >>>>>>>>>> >>>>>>>>>> Any hints on this? Anyone had the same problem? It does seem a >>>>>>>>>> serious issue as it affect the reliability and fault tolerance of >>>>>>>>>> the Storm >>>>>>>>>> cluster. >>>>>>>>>> >>>>>>>>>> In the meantime, I'll try to prepare a reproducible test case >>>>>>>>>> for this. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Danijel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi < >>>>>>>>>> dani...@schiavuzzi.com> wrote: >>>>>>>>>> >>>>>>>>>>> To (partially) answer my own question -- I still have no idea on >>>>>>>>>>> the cause of the stuck topology, but re-submitting the topology >>>>>>>>>>> helps -- >>>>>>>>>>> after re-submitting my topology is now running normally. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi < >>>>>>>>>>> dani...@schiavuzzi.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Also, I did have multiple cases of my IBackingMap workers >>>>>>>>>>>> dying (because of RuntimeExceptions) but successfully restarting >>>>>>>>>>>> afterwards >>>>>>>>>>>> (I throw RuntimeExceptions in the BackingMap implementation as my >>>>>>>>>>>> strategy >>>>>>>>>>>> in rare SQL database deadlock situations to force a worker restart >>>>>>>>>>>> and to >>>>>>>>>>>> fail+retry the batch). >>>>>>>>>>>> >>>>>>>>>>>> From the logs, one such IBackingMap worker death (and >>>>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the >>>>>>>>>>>> pending >>>>>>>>>>>> tuple: >>>>>>>>>>>> >>>>>>>>>>>> 2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] >>>>>>>>>>>> re-emitting batch, attempt 29698959:736 >>>>>>>>>>>> >>>>>>>>>>>> This is of course the normal behavior of a transactional >>>>>>>>>>>> topology, but this is the first time I've encountered a case of a >>>>>>>>>>>> batch >>>>>>>>>>>> retrying indefinitely. This is especially suspicious since the >>>>>>>>>>>> topology has >>>>>>>>>>>> been running fine for 20 days straight, re-emitting batches and >>>>>>>>>>>> restarting >>>>>>>>>>>> IBackingMap workers quite a number of times. >>>>>>>>>>>> >>>>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch >>>>>>>>>>>> with the exact txid value 29698959 has been committed -- but I >>>>>>>>>>>> suspect that >>>>>>>>>>>> could come from another BackingMap, since there are two BackingMap >>>>>>>>>>>> instances running (paralellismHint 2). >>>>>>>>>>>> >>>>>>>>>>>> However, I have no idea why the batch is being retried >>>>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by >>>>>>>>>>>> Trident. >>>>>>>>>>>> >>>>>>>>>>>> Any suggestions on the area (topology component) to focus my >>>>>>>>>>>> research on? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi < >>>>>>>>>>>> dani...@schiavuzzi.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> I'm having problems with my transactional Trident topology. It >>>>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck >>>>>>>>>>>>> processing a >>>>>>>>>>>>> single batch, with no tuples being emitted nor tuples being >>>>>>>>>>>>> persisted by >>>>>>>>>>>>> the TridentState (IBackingMap). >>>>>>>>>>>>> >>>>>>>>>>>>> It's a simple topology which consumes messages off a Kafka >>>>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus >>>>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql >>>>>>>>>>>>> transactional >>>>>>>>>>>>> TridentState implementation to persistentAggregate() data into a >>>>>>>>>>>>> SQL >>>>>>>>>>>>> database. >>>>>>>>>>>>> >>>>>>>>>>>>> In Zookeeper I can see Storm is re-trying a batch, i.e. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> "/transactional/<myTopologyName>/coordinator/currattempts" is >>>>>>>>>>>>> "{"29698959":6487}" >>>>>>>>>>>>> >>>>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch >>>>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper >>>>>>>>>>>>> keeps >>>>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident >>>>>>>>>>>>> and I have >>>>>>>>>>>>> no idea why, especially since the topology has been running >>>>>>>>>>>>> successfully >>>>>>>>>>>>> the last 20 days. >>>>>>>>>>>>> >>>>>>>>>>>>> I did rebalance the topology on one occasion, after which it >>>>>>>>>>>>> continued running normally. Other than that, no other >>>>>>>>>>>>> modifications were >>>>>>>>>>>>> done. Storm is at version 0.9.0.1. >>>>>>>>>>>>> >>>>>>>>>>>>> Any hints on how to debug the stuck topology? Any other >>>>>>>>>>>>> useful info I might provide? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>>>>> >>>>>>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>>>>> T: +385989035562 >>>>>>>>>>>>> Skype: danijel.schiavuzzi >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>>>> >>>>>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>>>> T: +385989035562 >>>>>>>>>>>> Skype: danijel.schiavuzzi >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>>> >>>>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>>> T: +385989035562 >>>>>>>>>>> Skype: danijels7 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Danijel Schiavuzzi >>>>>>>>>> >>>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>>> W: www.schiavuzzi.com >>>>>>>>>> T: +385989035562 >>>>>>>>>> Skype: danijels7 >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Danijel Schiavuzzi >>>>>>>>> >>>>>>>>> E: dani...@schiavuzzi.com >>>>>>>>> W: www.schiavuzzi.com >>>>>>>>> T: +385989035562 >>>>>>>>> Skype: danijels7 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Danijel Schiavuzzi >>>>>>> >>>>>>> E: dani...@schiavuzzi.com >>>>>>> W: www.schiavuzzi.com >>>>>>> T: +385989035562 >>>>>>> Skype: danijels7 >>>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Danijel Schiavuzzi >>>>> >>>>> E: dani...@schiavuzzi.com >>>>> W: www.schiavuzzi.com >>>>> T: +385989035562 >>>>> Skype: danijels7 >>>>> >>>> >>>> >>> >>> >>> -- >>> Danijel Schiavuzzi >>> >>> E: dani...@schiavuzzi.com >>> W: www.schiavuzzi.com >>> T: +385989035562 >>> Skype: danijels7 >>> >> >> > > -- > Danijel Schiavuzzi > > E: dani...@schiavuzzi.com > W: www.schiavuzzi.com > T: +385 98 9035562 > Skype: danijel.schiavuzzi > >