Re: Topology is stuck

Jason Jackson Sun, 06 Apr 2014 15:09:07 -0700

This could happen if you have corrupted input that always causes a batch to
fail and be retried.


I have seen this behaviour before and I didn't see corrupted input. It
might be a bug in trident, I'm not sure. If you figure it out please update
this thread and/or submit a patch.



On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi
<[email protected]>wrote:

> To (partially) answer my own question -- I still have no idea on the cause
> of the stuck topology, but re-submitting the topology helps -- after
> re-submitting my topology is now running normally.
>
>
> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
> [email protected]> wrote:
>
>> Also, I did have multiple cases of my IBackingMap workers dying (because
>> of RuntimeExceptions) but successfully restarting afterwards (I throw
>> RuntimeExceptions in the BackingMap implementation as my strategy in rare
>> SQL database deadlock situations to force a worker restart and to
>> fail+retry the batch).
>>
>> From the logs, one such IBackingMap worker death (and subsequent restart)
>> resulted in the Kafka spout re-emitting the pending tuple:
>>
>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>> batch, attempt 29698959:736
>>
>> This is of course the normal behavior of a transactional topology, but
>> this is the first time I've encountered a case of a batch retrying
>> indefinitely. This is especially suspicious since the topology has been
>> running fine for 20 days straight, re-emitting batches and restarting
>> IBackingMap workers quite a number of times.
>>
>> I can see in my IBackingMap backing SQL database that the batch with the
>> exact txid value 29698959 has been committed -- but I suspect that could
>> come from another BackingMap, since there are two BackingMap instances
>> running (paralellismHint 2).
>>
>> However, I have no idea why the batch is being retried indefinitely now
>> nor why it hasn't been successfully acked by Trident.
>>
>> Any suggestions on the area (topology component) to focus my research on?
>>
>> Thanks,
>>
>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I'm having problems with my transactional Trident topology. It has been
>>> running fine for about 20 days, and suddenly is stuck processing a single
>>> batch, with no tuples being emitted nor tuples being persisted by the
>>> TridentState (IBackingMap).
>>>
>>> It's a simple topology which consumes messages off a Kafka queue. The
>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
>>> and I use the trident-mssql transactional TridentState implementation to
>>> persistentAggregate() data into a SQL database.
>>>
>>> In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>
>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>> "{"29698959":6487}"
>>>
>>> ... and the attempt count keeps increasing. It seems the batch with txid
>>> 29698959 is stuck, as the attempt count in Zookeeper keeps increasing --
>>> seems like the batch isn't being acked by Trident and I have no idea why,
>>> especially since the topology has been running successfully the last 20
>>> days.
>>>
>>> I did rebalance the topology on one occasion, after which it continued
>>> running normally. Other than that, no other modifications were done. Storm
>>> is at version 0.9.0.1.
>>>
>>> Any hints on how to debug the stuck topology? Any other useful info I
>>> might provide?
>>>
>>> Thanks,
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: [email protected]
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijel.schiavuzzi
>>>
>>
>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: [email protected]
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijel.schiavuzzi
>>
>
>
>
> --
> Danijel Schiavuzzi
>
> E: [email protected]
> W: www.schiavuzzi.com
> T: +385989035562
> Skype: danijels7
>

Re: Topology is stuck

Reply via email to