Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

M.Tarkeshwar Rao Mon, 15 Sep 2014 23:08:07 -0700

Hi Danijel,

Is the issue resolved in any version of the storm?


Regards
Tarkeshwar

On Thu, Jul 17, 2014 at 6:57 PM, Danijel Schiavuzzi <[email protected]>
wrote:

> I've filled a bug report for this under
> https://issues.apache.org/jira/browse/STORM-406
>
> The issue is 100% reproducible with, it seems, any Trident topology and
> across multiple Storm versions with Netty transport enabled. 0MQ is working
> fine. You can try with TridentWordCount from storm-starter, for example.
>
> Your insight seems correct: when the killed worker re-spawns on the same
> slot (port), the topology stops processing. See the above JIRA for
> additional info.
>
> Danijel
>
>
>
>
> On Thu, Jul 17, 2014 at 7:20 AM, M.Tarkeshwar Rao <[email protected]>
> wrote:
>
>> Thanks Danijel for helping me.
>>
>>
>> On Thu, Jul 17, 2014 at 1:37 AM, Danijel Schiavuzzi <
>> [email protected]> wrote:
>>
>>> I see no issues with your cluster configuration.
>>>
>>> You should definitely share the (simplified if possible) topology
>>> code and the steps to reproduce the blockage, better yet you should file a
>>> JIRA task on Apache's JIRA web -- be sure to include your Trident
>>> internals modifications.
>>>
>>> Unfortunately, seems I'm having the same issues now with Storm 0.9.2
>>> too, so I might get back here with some updates soon. It's not so fast
>>> and easily reproducible as it was under 0.9.1, but the bug
>>> seems nonetheless still present. I'll reduce the number of Storm slots and
>>> topology workers as per your insights, hopefully this might make it easier
>>> to reproduce the bug with a simplified Trident topology.
>>>
>>>
>>> On Tuesday, July 15, 2014, M.Tarkeshwar Rao <[email protected]>
>>> wrote:
>>>
>>>> Hi Denijel,
>>>>
>>>> We have done few changes in the the trident core framework code as per
>>>> our need which is working fine with zeromq. I am sharing configuration
>>>> which we are using. Can you please suggest our config is fine or not?
>>>>
>>>>  Code part is so large so we are writing some sample topology and
>>>> trying to reproduce the issue, which we will share with you.
>>>>
>>>> What are the steps to reproduce the issue:
>>>>  -------------------------------------------------------------
>>>>
>>>> 1. we deployed our topology with one linux machine, two workers and one
>>>> acker with batch size 2.
>>>> 2. both the worker are up and start the processing.
>>>> 3. after few seconds i killed one of the worker kill -9.
>>>> 4. when the killed worker spawned on the same port it is getting hanged.
>>>> 5. only retries going on.
>>>> 6. when the killed worker spawned on the another port everything
>>>> working fine.
>>>>
>>>> machine conf:
>>>> --------------------------
>>>> [root@sb6270x1637-2 conf]# uname -a
>>>>
>>>> Linux bl460cx2378 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43
>>>> EST 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>>
>>>> *storm.yaml* which we are using to launch  nimbus, supervisor and ui
>>>>
>>>> ########## These MUST be filled in for a storm configuration
>>>>  storm.zookeeper.servers:
>>>>      - "10.61.244.86"
>>>>  storm.zookeeper.port: 2000
>>>>  supervisor.slots.ports:
>>>>     - 6788
>>>>     - 6789
>>>>     - 6800
>>>>     - 6801
>>>>     - 6802
>>>>      - 6803
>>>>
>>>>  nimbus.host: "10.61.244.86"
>>>>
>>>>
>>>>  storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>>
>>>>  storm.messaging.netty.server_worker_threads: 10
>>>>  storm.messaging.netty.client_worker_threads: 10
>>>>  storm.messaging.netty.buffer_size: 5242880
>>>>  storm.messaging.netty.max_retries: 100
>>>>  storm.messaging.netty.max_wait_ms: 1000
>>>>  storm.messaging.netty.min_wait_ms: 100
>>>>  storm.local.dir: "/root/home_98/home/enavgoy/storm-local"
>>>>  storm.scheduler: "com.ericsson.storm.scheduler.TopologyScheduler"
>>>>  topology.acker.executors: 1
>>>>  topology.message.timeout.secs: 30
>>>>  supervisor.scheduler.meta:
>>>>       name: "supervisor1"
>>>>
>>>>
>>>>  worker.childopts: "-Xmx2048m"
>>>>
>>>>  mm.hdfs.ipaddress: "10.61.244.7"
>>>>  mm.hdfs.port: 9000
>>>>  topology.batch.size: 2
>>>>  topology.batch.timeout: 10000
>>>>  topology.workers: 2
>>>>  topology.debug: true
>>>>
>>>> Regards
>>>> Tarkeshwar
>>>>
>>>>
>>>>
>>>> On Mon, Jul 7, 2014 at 1:22 PM, Danijel Schiavuzzi <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Tarkeshwar,
>>>>>
>>>>> Could you provide a code sample of your topology? Do you have any
>>>>> special configs enabled?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Danijel
>>>>>
>>>>>
>>>>> On Mon, Jul 7, 2014 at 9:01 AM, M.Tarkeshwar Rao <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Danijel,
>>>>>>
>>>>>> We are able to reproduce this issue with 0.9.2 as well.
>>>>>> We have two worker setup to run the trident topology.
>>>>>>
>>>>>> When we kill one of the worker and again when that killed worker
>>>>>> spawn on same port(same slot) then that worker not able to communicate 
>>>>>> with
>>>>>> 2nd worker.
>>>>>>
>>>>>> only transaction attempts are increasing continuously.
>>>>>>
>>>>>> But if the killed worker spawn on new slot(new communication port)
>>>>>> then it working fine. Same behavior as in storm 9.0.1.
>>>>>>
>>>>>> Please update me if you get any new development.
>>>>>>
>>>>>> Regards
>>>>>> Tarkeshwar
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 3, 2014 at 7:06 PM, Danijel Schiavuzzi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Bobby,
>>>>>>>
>>>>>>> Just an update on the stuck Trident transactional topology issue --
>>>>>>> I've upgraded to Storm 0.9.2-incubating (from 0.9.1-incubating) and 
>>>>>>> can't
>>>>>>> reproduce the bug anymore. Will keep you posted if any issues arise.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Danijel
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 16, 2014 at 7:56 PM, Bobby Evans <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>  I have not seen this before, if you could file a JIRA on this
>>>>>>>> that would be great.
>>>>>>>>
>>>>>>>>  - Bobby
>>>>>>>>
>>>>>>>>   From: Danijel Schiavuzzi <[email protected]>
>>>>>>>> Reply-To: "[email protected]" <
>>>>>>>> [email protected]>
>>>>>>>> Date: Wednesday, June 4, 2014 at 10:30 AM
>>>>>>>> To: "[email protected]" <
>>>>>>>> [email protected]>, "[email protected]"
>>>>>>>> <[email protected]>
>>>>>>>> Subject: Trident transactional topology stuck re-emitting batches
>>>>>>>> with Netty, but running fine with ZMQ (was Re: Topology is stuck)
>>>>>>>>
>>>>>>>>   Hi all,
>>>>>>>>
>>>>>>>> I've managed to reproduce the stuck topology problem and it seems
>>>>>>>> it's due to the Netty transport. Running with ZMQ transport enabled 
>>>>>>>> now and
>>>>>>>> I haven't been able to reproduce this.
>>>>>>>>
>>>>>>>>  The problem is basically a Trident/Kafka transactional topology
>>>>>>>> getting stuck, i.e. re-emitting the same batches over and over again. 
>>>>>>>> This
>>>>>>>> happens after the Storm workers restart a few times due to Kafka spout
>>>>>>>> throwing RuntimeExceptions (because of the Kafka consumer in the spout
>>>>>>>> timing out with a SocketTimeoutException due to some temporary network
>>>>>>>> problems). Sometimes the topology is stuck after just one worker is
>>>>>>>> restarted, and sometimes a few worker restarts are needed to trigger 
>>>>>>>> the
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> I simulated the Kafka spout socket timeouts by blocking network
>>>>>>>> access from Storm to my Kafka machines (with an iptables firewall 
>>>>>>>> rule).
>>>>>>>> Most of the time the spouts (workers) would restart normally (after
>>>>>>>> re-enabling access to Kafka) and the topology would continue to process
>>>>>>>> batches, but sometimes the topology would get stuck re-emitting batches
>>>>>>>> after the crashed workers restarted. Killing and re-submitting the 
>>>>>>>> topology
>>>>>>>> manually fixes this always, and processing continues normally.
>>>>>>>>
>>>>>>>>  I haven't been able to reproduce this scenario after reverting my
>>>>>>>> Storm cluster's transport to ZeroMQ. With Netty transport, I can almost
>>>>>>>> always reproduce the problem by causing a worker to restart a number of
>>>>>>>> times (only about 4-5 worker restarts are enough to trigger this).
>>>>>>>>
>>>>>>>>  Any hints on this? Anyone had the same problem? It does seem a
>>>>>>>> serious issue as it affect the reliability and fault tolerance of the 
>>>>>>>> Storm
>>>>>>>> cluster.
>>>>>>>>
>>>>>>>>  In the meantime, I'll try to prepare a reproducible test case for
>>>>>>>> this.
>>>>>>>>
>>>>>>>>  Thanks,
>>>>>>>>
>>>>>>>> Danijel
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 31, 2014 at 4:39 PM, Danijel Schiavuzzi <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> To (partially) answer my own question -- I still have no idea on
>>>>>>>>> the cause of the stuck topology, but re-submitting the topology helps 
>>>>>>>>> --
>>>>>>>>> after re-submitting my topology is now running normally.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>>  Also, I did have multiple cases of my IBackingMap workers dying
>>>>>>>>>> (because of RuntimeExceptions) but successfully restarting 
>>>>>>>>>> afterwards (I
>>>>>>>>>> throw RuntimeExceptions in the BackingMap implementation as my 
>>>>>>>>>> strategy in
>>>>>>>>>> rare SQL database deadlock situations to force a worker restart and 
>>>>>>>>>> to
>>>>>>>>>> fail+retry the batch).
>>>>>>>>>>
>>>>>>>>>>  From the logs, one such IBackingMap worker death (and
>>>>>>>>>> subsequent restart) resulted in the Kafka spout re-emitting the 
>>>>>>>>>> pending
>>>>>>>>>> tuple:
>>>>>>>>>>
>>>>>>>>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO]
>>>>>>>>>> re-emitting batch, attempt 29698959:736
>>>>>>>>>>
>>>>>>>>>>  This is of course the normal behavior of a transactional
>>>>>>>>>> topology, but this is the first time I've encountered a case of a 
>>>>>>>>>> batch
>>>>>>>>>> retrying indefinitely. This is especially suspicious since the 
>>>>>>>>>> topology has
>>>>>>>>>> been running fine for 20 days straight, re-emitting batches and 
>>>>>>>>>> restarting
>>>>>>>>>> IBackingMap workers quite a number of times.
>>>>>>>>>>
>>>>>>>>>> I can see in my IBackingMap backing SQL database that the batch
>>>>>>>>>> with the exact txid value 29698959 has been committed -- but I 
>>>>>>>>>> suspect that
>>>>>>>>>> could come from another BackingMap, since there are two BackingMap
>>>>>>>>>> instances running (paralellismHint 2).
>>>>>>>>>>
>>>>>>>>>>  However, I have no idea why the batch is being retried
>>>>>>>>>> indefinitely now nor why it hasn't been successfully acked by 
>>>>>>>>>> Trident.
>>>>>>>>>>
>>>>>>>>>> Any suggestions on the area (topology component) to focus my
>>>>>>>>>> research on?
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>>   Hello,
>>>>>>>>>>>
>>>>>>>>>>> I'm having problems with my transactional Trident topology. It
>>>>>>>>>>> has been running fine for about 20 days, and suddenly is stuck 
>>>>>>>>>>> processing a
>>>>>>>>>>> single batch, with no tuples being emitted nor tuples being 
>>>>>>>>>>> persisted by
>>>>>>>>>>> the TridentState (IBackingMap).
>>>>>>>>>>>
>>>>>>>>>>> It's a simple topology which consumes messages off a Kafka
>>>>>>>>>>> queue. The spout is an instance of storm-kafka-0.8-plus
>>>>>>>>>>> TransactionalTridentKafkaSpout and I use the trident-mssql 
>>>>>>>>>>> transactional
>>>>>>>>>>> TridentState implementation to persistentAggregate() data into a SQL
>>>>>>>>>>> database.
>>>>>>>>>>>
>>>>>>>>>>>  In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>>>>>>>>>
>>>>>>>>>>>      "/transactional/<myTopologyName>/coordinator/currattempts"
>>>>>>>>>>> is "{"29698959":6487}"
>>>>>>>>>>>
>>>>>>>>>>> ... and the attempt count keeps increasing. It seems the batch
>>>>>>>>>>> with txid 29698959 is stuck, as the attempt count in Zookeeper keeps
>>>>>>>>>>> increasing -- seems like the batch isn't being acked by Trident and 
>>>>>>>>>>> I have
>>>>>>>>>>> no idea why, especially since the topology has been running 
>>>>>>>>>>> successfully
>>>>>>>>>>> the last 20 days.
>>>>>>>>>>>
>>>>>>>>>>>  I did rebalance the topology on one occasion, after which it
>>>>>>>>>>> continued running normally. Other than that, no other modifications 
>>>>>>>>>>> were
>>>>>>>>>>> done. Storm is at version 0.9.0.1.
>>>>>>>>>>>
>>>>>>>>>>>  Any hints on how to debug the stuck topology? Any other useful
>>>>>>>>>>> info I might provide?
>>>>>>>>>>>
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>> E: [email protected]
>>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>>> T: +385989035562
>>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>>
>>>>>>>>>> E: [email protected]
>>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>>> T: +385989035562
>>>>>>>>>> Skype: danijel.schiavuzzi
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Danijel Schiavuzzi
>>>>>>>>>
>>>>>>>>> E: [email protected]
>>>>>>>>> W: www.schiavuzzi.com
>>>>>>>>> T: +385989035562
>>>>>>>>>  Skype: danijels7
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Danijel Schiavuzzi
>>>>>>>>
>>>>>>>> E: [email protected]
>>>>>>>> W: www.schiavuzzi.com
>>>>>>>> T: +385989035562
>>>>>>>> Skype: danijels7
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Danijel Schiavuzzi
>>>>>>>
>>>>>>> E: [email protected]
>>>>>>> W: www.schiavuzzi.com
>>>>>>> T: +385989035562
>>>>>>> Skype: danijels7
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Danijel Schiavuzzi
>>>>>
>>>>> E: [email protected]
>>>>> W: www.schiavuzzi.com
>>>>> T: +385989035562
>>>>> Skype: danijels7
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: [email protected]
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijels7
>>>
>>
>>
>
>
> --
> Danijel Schiavuzzi
>
> E: [email protected]
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>

Re: Trident transactional topology stuck re-emitting batches with Netty, but running fine with ZMQ (was Re: Topology is stuck)

Reply via email to