Re: Topology is stuck

Danijel Schiavuzzi Wed, 09 Apr 2014 01:11:19 -0700

Thanks Jason. However, I don't think that was the case in my stuck
topology, otherwise I'd have seen exceptions (thrown by my Trident
functions) in the worker logs.



On Wed, Apr 9, 2014 at 3:02 AM, Jason Jackson <[email protected]> wrote:

> An example of "corrupted input" that causes a batch to fail would be for
> example if you expected a schema to your data that you read off kafka, or
> some queue, and for whatever reason the data didn't conform to your schema
> and the function that you implement that you pass to stream.each throws an
> exception when this unexpected situation occurs. This would cause the batch
> to be retried, but it's deterministically failing, so the batch will be
> retried forever.
>
>
> On Mon, Apr 7, 2014 at 10:37 AM, Danijel Schiavuzzi <
> [email protected]> wrote:
>
>> Hi Jason,
>>
>> Could you be more specific -- what do you mean by "corrupted input"?
>> Do you mean that there's a bug in Trident itself that causes the tuples in
>> a batch to somehow become corrupted?
>>
>> Thanks a lot!
>>
>> Danijel
>>
>>
>> On Monday, April 7, 2014, Jason Jackson <[email protected]> wrote:
>>
>>> This could happen if you have corrupted input that always causes a batch
>>> to fail and be retried.
>>>
>>> I have seen this behaviour before and I didn't see corrupted input. It
>>> might be a bug in trident, I'm not sure. If you figure it out please update
>>> this thread and/or submit a patch.
>>>
>>>
>>>
>>> On Mon, Mar 31, 2014 at 7:39 AM, Danijel Schiavuzzi <
>>> [email protected]> wrote:
>>>
>>> To (partially) answer my own question -- I still have no idea on the
>>> cause of the stuck topology, but re-submitting the topology helps -- after
>>> re-submitting my topology is now running normally.
>>>
>>>
>>> On Wed, Mar 26, 2014 at 6:04 PM, Danijel Schiavuzzi <
>>> [email protected]> wrote:
>>>
>>> Also, I did have multiple cases of my IBackingMap workers dying (because
>>> of RuntimeExceptions) but successfully restarting afterwards (I throw
>>> RuntimeExceptions in the BackingMap implementation as my strategy in rare
>>> SQL database deadlock situations to force a worker restart and to
>>> fail+retry the batch).
>>>
>>> From the logs, one such IBackingMap worker death (and subsequent
>>> restart) resulted in the Kafka spout re-emitting the pending tuple:
>>>
>>>     2014-03-22 16:26:43 s.k.t.TridentKafkaEmitter [INFO] re-emitting
>>> batch, attempt 29698959:736
>>>
>>> This is of course the normal behavior of a transactional topology, but
>>> this is the first time I've encountered a case of a batch retrying
>>> indefinitely. This is especially suspicious since the topology has been
>>> running fine for 20 days straight, re-emitting batches and restarting
>>> IBackingMap workers quite a number of times.
>>>
>>> I can see in my IBackingMap backing SQL database that the batch with the
>>> exact txid value 29698959 has been committed -- but I suspect that could
>>> come from another BackingMap, since there are two BackingMap instances
>>> running (paralellismHint 2).
>>>
>>> However, I have no idea why the batch is being retried indefinitely now
>>> nor why it hasn't been successfully acked by Trident.
>>>
>>> Any suggestions on the area (topology component) to focus my research on?
>>>
>>> Thanks,
>>>
>>> On Wed, Mar 26, 2014 at 5:32 PM, Danijel Schiavuzzi <
>>> [email protected]> wrote:
>>>
>>> Hello,
>>>
>>> I'm having problems with my transactional Trident topology. It has been
>>> running fine for about 20 days, and suddenly is stuck processing a single
>>> batch, with no tuples being emitted nor tuples being persisted by the
>>> TridentState (IBackingMap).
>>>
>>> It's a simple topology which consumes messages off a Kafka queue. The
>>> spout is an instance of storm-kafka-0.8-plus TransactionalTridentKafkaSpout
>>> and I use the trident-mssql transactional TridentState implementation to
>>> persistentAggregate() data into a SQL database.
>>>
>>> In Zookeeper I can see Storm is re-trying a batch, i.e.
>>>
>>>      "/transactional/<myTopologyName>/coordinator/currattempts" is
>>> "{"29698959":6487}"
>>>
>>> ... and the attempt count keeps increasing. It seems the batch with txid
>>> 29698959 is stuck, as the attempt count in Zookeeper keeps increasing --
>>> seems like the batch isn't being acked by Trident and I have no idea why,
>>> especially since the topology has been running successfully the last 20
>>> days.
>>>
>>> I did rebalance the topology on one occasion, after which it continued
>>> running normally. Other than that, no other modifications were done. Storm
>>> is at version 0.9.0.1.
>>>
>>> Any hints on how to debug the stuck topology? Any other useful info I
>>> might provide?
>>>
>>> Thanks,
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: [email protected]
>>> W: www.schiavuzzi.com
>>> T: +385989035562
>>> Skype: danijel.schiavuzzi
>>>
>>>
>>>
>>>
>>> --
>>> Danijel Schiavuzzi
>>>
>>> E: danije
>>>
>>>
>>
>> --
>> Danijel Schiavuzzi
>>
>> E: [email protected]
>> W: www.schiavuzzi.com
>> T: +385989035562
>> Skype: danijels7
>>
>
>


-- 
Danijel Schiavuzzi

E: [email protected]
W: www.schiavuzzi.com
T: +385989035562
Skype: danijels7

Re: Topology is stuck

Reply via email to