Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

M.Tarkeshwar Rao Tue, 16 Sep 2014 21:32:58 -0700

In which version it is available.
On 16 Sep 2014 19:01, "Danijel Schiavuzzi" <dani...@schiavuzzi.com> wrote:


> Yes, it's been fixed in 'master' for some time now.
>
> Danijel
>
> On Tuesday, September 16, 2014, M.Tarkeshwar Rao <tarkeshwa...@gmail.com>
> wrote:
>
>> Hi Danijel,
>>
>> Is the issue resolved in any version of the storm?
>>
>> Regards
>> Tarkeshwar
>>
>> On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi <
>> dani...@schiavuzzi.com> wrote:
>>
>>> I've filled a bug report for this under
>>> https://issues.apache.org/jira/browse/STORM-406
>>>
>>> The issue is 100% reproducible with, it seems, any Trident topology and
>>> across multiple Storm versions with Netty transport enabled. 0MQ is working
>>> fine. You can try with TridentWordCount from storm-starter, for example.
>>>
>>> Your insight seems correct: when the killed worker re-spawns on the same
>>> slot (port), the topology stops processing. See the above JIRA for
>>> additional info.
>>>
>>> Danijel
>>>
>>>
>>>
>>>
>>> On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <
>>> tarkeshwa...@gmail.com> wrote:
>>>
>>>> Thanks Danijel for helping me.
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi <
>>>> dani...@schiavuzzi.com> wrote:
>>>>
>>>>> I see no issues with your cluster configuration.
>>>>>
>>>>> You should definitely share the (simplified if possible) topology
>>>>> code and the steps to reproduce the blockage, better yet you should file a
>>>>> JIRA task on Apache's JIRA web -- be sure to include your Trident
>>>>> internals modifications.
>>>>>
>>>>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2
>>>>> too, so I might get back here with some updates soon. It's not so fast
>>>>> and easily reproducible as it was under 0.9.1, but the bug
>>>>> seems nonetheless still present. I'll reduce the number of Storm slots and
>>>>> topology workers as per your insights, hopefully this might make it easier
>>>>> to reproduce the bug with a simplified Trident topology.
>>>>>
>>>>>
>>>>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <tarkeshwa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Denijel,
>>>>>>
>>>>>> We have done few changes in the the trident core framework code as
>>>>>> per our need which is working fine with zeromq. I am sharing 
>>>>>> configuration
>>>>>> which we are using. Can you please suggest our config is fine or not?
>>>>>>
>>>>>>  Code part is so large so we are writing some sample topology and
>>>>>> trying to reproduce the issue, which we will share with you.
>>>>>>
>>>>>> What are the steps to reproduce the issue:
>>>>>>  -------------------------------------------------------------
>>>>>>
>>>>>> 1. we deployed our topology with one linux machine, two workers and
>>>>>> one acker with batch size 2.
>>>>>> 2. both the worker are up and start the processing.
>>>>>> 3. after few seconds i killed one of the worker kill -9.
>>>>>> 4. when the killed worker spawned on the same port it is getting
>>>>>> hanged.
>>>>>> 5. only retries going on.
>>>>>> 6. when the killed worker spawned on the another port everything
>>>>>> working fine.
>>>>>>
>>>>>> machine conf:
>>>>>> --------------------------
>>>>>> [root@sb6270x1637-2 conf]# uname -a
>>>>>>
>>>>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10
>>>>>> 14:46:43 EST 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>
>>>>>>
>>>>>> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>>>>>>
>>>>>> ########## These MUST be filled in for a storm configuration
>>>>>>  storm.zookeeper.servers:
>>>>>>      - "10.61.244.86"
>>>>>>  storm.zookeeper.port: 2000
>>>>>>  supervisor.slots.ports:
>>>>>>     - 6788
>>>>>>     - 6789
>>>>>>     - 6800
>>>>>>     - 6801
>>>>>>     - 6802
>>>>>>      - 6803
>>>>>>
>>>>>>  nimbus.host: "10.61.244.86"
>>>>>>
>>>>>>
>>>>>>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>>>>
>>>>>>  storm.messaging.netty.server_worker_threads: 10
>>>>>>  storm.messaging.netty.client_worker_threads: 10
>>>>>>  storm.messaging.netty.buffer_size: 5242880
>>>>>>  storm.messaging.netty.max_retries: 100
>>>>>>  storm.messaging.netty.max_wait_ms: 1000
>>>>>>  storm.messaging.netty.min_wait_ms: 100
>>>>>>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>>>>>>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>>>>>>  topology.acker.executors: 1
>>>>>>  topology.message.timeout.secs: 30
>>>>>>  supervisor.scheduler.meta:
>>>>>>       name: "supervisor1"
>>>>>>
>>>>>>
>>>>>>  worker.childopts: "-Xmx2048m"
>>>>>>
>>>>>>  mm.hdfs.ipaddress: "10.61.244.7"
>>>>>>  mm.hdfs.port: 9000
>>>>>>  topology.batch.size: 2
>>>>>>  topology.batch.timeout: 10000
>>>>>>  topology.workers: 2
>>>>>>  topology.debug: true
>>>>>>
>>>>>> Regards
>>>>>> Tarkeshwar
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <
>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>
>>>>>>> Hi Tarkeshwar,
>>>>>>>
>>>>>>> Could you provide a code sample of your topology? Do you have any
>>>>>>> special configs enabled?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Danijel
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <
>>>>>>> tarkeshwa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Danijel,
>>>>>>>>
>>>>>>>> We are able to reproduce this issue with 0.9.2 as well.
>>>>>>>> We have two worker setup to run the trident topology.
>>>>>>>>
>>>>>>>> When we kill one of the worker and again when that killed worker
>>>>>>>> spawn on same port(same slot) then that worker not able to communicate 
>>>>>>>> with
>>>>>>>> 2nd worker.
>>>>>>>>
>>>>>>>> only transaction attempts are increasing continuously.
>>>>>>>>
>>>>>>>> But if the killed worker spawn on new slot(new communication port)
>>>>>>>> then it working fine. Same behavior as in storm 9.0.1.
>>>>>>>>
>>>>>>>> Please update me if you get any new development.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Tarkeshwar
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Bobby,
>>>>>>>>>
>>>>>>>>> Just an update on the stuck Trident transactional topology issue
>>>>>>>>> -- I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and
>>>>>>>>> can't reproduce the bug anymore. Will keep you posted if any issues 
>>>>>>>>> arise.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Danijel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <ev...@yahoo-inc.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>  I have not seen this before, if you could file a JIRA on this
>>>>>>>>>> that would be great.
>>>>>>>>>>
>>>>>>>>>>  - Bobby
>>>>>>>>>>
>>>>>>>>>>   From: Danijel Schiavuzzi <dani...@schiavuzzi.com>
>>>>>>>>>> Reply-To: "user@storm.incubator.apache.org" <
>>>>>>>>>> user@storm.incubator.apache.org>
>>>>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>>>>>>> To: "user@storm.incubator.apache.org" <
>>>>>>>>>> user@storm.incubator.apache.org>, "d...@storm.incubator.apache.org"
>>>>>>>>>> <d...@storm.incubator.apache.org>
>>>>>>>>>> Subject: Trident transactional topology stuck re-emitting
>>>>>>>>>> batches with Netty, but running fine with ZMQ (was Re: Topology is 
>>>>>>>>>> stuck)
>>>>>>>>>>
>>>>>>>>>>   Hi all,
>>>>>>>>>>
>>>>>>>>>> I've managed to reproduce the stuck topology problem and it seems
>>>>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled 
>>>>>>>>>> now and
>>>>>>>>>> I haven't been able to reproduce this.
>>>>>>>>>>
>>>>>>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>>>>>>> getting stuck, i.e. re-emitting the same batches over and over 
>>>>>>>>>> again. This
>>>>>>>>>> happens after the Storm workers restart a few times due to Kafka 
>>>>>>>>>> spout
>>>>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the 
>>>>>>>>>> spout
>>>>>>>>>> timing out with a SocketTimeoutException due to some temporary 
>>>>>>>>>> network
>>>>>>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger 
>>>>>>>>>> the
>>>>>>>>>> problem.
>>>>>>>>>>
>>>>>>>>>> I simulated the Kafka spout socket timeouts by blocking network
>>>>>>>>>> access from Storm to my Kafka machines (with an iptables firewall 
>>>>>>>>>> rule).
>>>>>>>>>> Most of the time the spouts (workers) would restart normally (after
>>>>>>>>>> re-enabling access to Kafka) and the topology would continue to 
>>>>>>>>>> process
>>>>>>>>>> batches, but sometimes the topology would get stuck re-emitting 
>>>>>>>>>> batches
>>>>>>>>>> after the crashed workers restarted. Killing and re-submitting the 
>>>>>>>>>> topology
>>>>>>>>>> manually fixes this always, and processing continues normally.
>>>>>>>>>>
>>>>>>>>>>  I haven't been able to reproduce this scenario after reverting
>>>>>>>>>> my Storm cluster's transport to ZeroMQ. With Netty transport, I can 
>>>>>>>>>> almost
>>>>>>>>>> always reproduce the problem by causing a worker to restart a number 
>>>>>>>>>> of
>>>>>>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>>>>>>
>>>>>>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>>>>>>> serious issue as it affect the reliability and fault tolerance of 
>>>>>>>>>> the Storm
>>>>>>>>>> cluster.
>>>>>>>>>>
>>>>>>>>>>  In the meantime, I'll try to prepare a reproducible test case
>>>>>>>>>> for this.
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>
>>>>>>>>>> Danijel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> To (partially) answer my own question -- I still have no idea on
>>>>>>>>>>> the cause of the stuck topology, but re-submitting the topology 
>>>>>>>>>>> helps --
>>>>>>>>>>> after re-submitting my topology is now running normally.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>  Also, I did have multiple cases of my IBackingMap workers
>>>>>>>>>>>> dying (because of RuntimeExceptions) but successfully restarting 
>>>>>>>>>>>> afterwards
>>>>>>>>>>>> (I throw RuntimeExceptions in the BackingMap implementation as my 
>>>>>>>>>>>> strategy
>>>>>>>>>>>> in rare SQL database deadlock situations to force a worker restart 
>>>>>>>>>>>> and to
>>>>>>>>>>>> fail+retry the batch).
>>>>>>>>>>>>
>>>>>>>>>>>>  From the logs, one such IBackingMap worker death (and
>>>>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the 
>>>>>>>>>>>> pending
>>>>>>>>>>>> tuple:
>>>>>>>>>>>>
>>>>>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO]
>>>>>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>>>>>
>>>>>>>>>>>>  This is of course the normal behavior of a transactional
>>>>>>>>>>>> topology, but this is the first time I've encountered a case of a 
>>>>>>>>>>>> batch
>>>>>>>>>>>> retrying indefinitely. This is especially suspicious since the 
>>>>>>>>>>>> topology has
>>>>>>>>>>>> been running fine for 20 days straight, re-emitting batches and 
>>>>>>>>>>>> restarting
>>>>>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>>>>>
>>>>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch
>>>>>>>>>>>> with the exact txid value 29698959 has been committed -- but I 
>>>>>>>>>>>> suspect that
>>>>>>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>>>>>
>>>>>>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by 
>>>>>>>>>>>> Trident.
>>>>>>>>>>>>
>>>>>>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>>>>>>> research on?
>>>>>>>>>>>>
>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>>>>>>> dani...@schiavuzzi.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>   Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm having problems with my transactional Trident topology. It
>>>>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck 
>>>>>>>>>>>>> processing a
>>>>>>>>>>>>> single batch, with no tuples being emitted nor tuples being 
>>>>>>>>>>>>> persisted by
>>>>>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a simple topology which consumes messages off a Kafka
>>>>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql 
>>>>>>>>>>>>> transactional
>>>>>>>>>>>>> TridentState implementation to persistentAggregate() data into a 
>>>>>>>>>>>>> SQL
>>>>>>>>>>>>> database.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/transactional/<myTopologyName>/coordinator/currattempts" is
>>>>>>>>>>>>> "{"29698959":6487}"
>>>>>>>>>>>>>
>>>>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch
>>>>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper 
>>>>>>>>>>>>> keeps
>>>>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident 
>>>>>>>>>>>>> and I have
>>>>>>>>>>>>> no idea why, especially since the topology has been running 
>>>>>>>>>>>>> successfully
>>>>>>>>>>>>> the last 20 days.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>>>>>>> continued running normally. Other than that, no other 
>>>>>>>>>>>>> modifications were
>>>>>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Any hints on how to debug the stuck topology? Any other
>>>>>>>>>>>>> useful info I might provide?
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>>>
>>>>>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>>>> T: +385989035562
>>>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>>
>>>>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>>> T: +385989035562
>>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>> T: +385989035562
>>>>>>>>>>>  Skype: danijels7
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>> T: +385989035562
>>>>>>>>>> Skype: danijels7
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: dani...@schiavuzzi.com
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>> Skype: danijels7
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: dani...@schiavuzzi.com
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijels7
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: dani...@schiavuzzi.com
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijels7
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: dani...@schiavuzzi.com
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>
> --
> Danijel Schiavuzzi
>
> E: dani...@schiavuzzi.com
> W: www.schiavuzzi.com
> T: +385 98 9035562
> Skype: danijel.schiavuzzi
>
>

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Reply via email to