Re: How does storm guarantee order of tuples processed?

Nathan Marz Wed, 21 Jan 2015 13:48:38 -0800

Spouts and bolts provides you an at-least once guarantee, so it's
completely up to you to figure out how to get your app to work with that.
Storm is unable to provide you any help besides replaying the tuples.


Trident, on the other hand, does all state updates in the "State"
abstraction and gives you a monotonically increasing batch id whenever
state updates are to be applied. If you store that batch id with whatever
state you're updating, you can detect when you're seeing something that's
been successfully processed before or whether it's brand new. This is
described in that state doc I sent.

On Wed, Jan 21, 2015 at 4:15 PM, Shawn Bonnin <[email protected]> wrote:

> Nathan, First, thanks a lot for the quick response. I read through the
> Trident guarantees. Seems like micro-batching will help with the exactly
> once guarantees on the bolts that write to external data stores in the
> commit phase of a batch.
>
> However, I have  clarifying question to what you said -
>
> *Suppose for example your spout emits tuples A, B, C, D, E and tuple C
> fails. A spout like KestrelSpout would re-emit only tuple C. KafkaSpout, on
> the other hand, would also re-emit all tuples after the failed tuple. So it
> would re-emit C, D, and E, even if D and E were successfully processed'*
>
> My question is how does the Kafka Spout know how many tuples were sent
> through after C? Does it rely on Zookeeper to get the offsets and just
> replay everything after that offset? If yes then do we have to handle the
> repercussions of state corruption etc. in our downstream bolts? Our
> downstream bolts will be looking for event sequence based patterns so when
> they see the same event twice, they will need smarts to know when that was
> due to a system failure and replay vs. an actual business occurrence.
>
> Seems like these smarts will need to be built regardless of whether we do
> tuple at a time processing or use Trident.
>
> Am I correct in my assessment?
>
>
> Thanks a lot!
>
> On Wed, Jan 21, 2015 at 11:43 AM, Nathan Marz <[email protected]>
> wrote:
>
>> There's no such thing as a total order in a distributed system, as
>> streams are processed in parallel. The ordering guarantee Storm provides is
>> that tuples sent between tasks are received in the order they were sent.
>>
>> Another part of your question is what kind of ordering guarantees you get
>> during failures. With regular Storm, when a tuple fails it depends on the
>> spout to determine what to re-emit. Suppose for example your spout emits
>> tuples A, B, C, D, E and tuple C fails. A spout like KestrelSpout would
>> re-emit only tuple C. KafkaSpout, on the other hand, would also re-emit all
>> tuples after the failed tuple. So it would re-emit C, D, and E, even if D
>> and E were successfully processed.
>>
>> Trident provides stronger ordering guarantees, as it provides a total
>> ordering among the commit phases for batches. So if a batch fails to commit
>> it will be retried indefinitely until it succeeds. See
>> http://storm.apache.org/documentation/Trident-state.html and
>> http://storm.apache.org/documentation/Trident-spouts.html for more info
>> on this.
>>
>> On Wed, Jan 21, 2015 at 2:34 PM, Shawn Bonnin <[email protected]>
>> wrote:
>>
>>> Trying to look for patterns in the input stream based on the arrival
>>> sequence. We can use something like kafka on the input so guarantee order
>>> but once the tuples enter the topology, how can we make sure that they are
>>> processed in the same order as they arrived on Kafka.
>>>
>>> On Wed, Jan 21, 2015 at 11:30 AM, Naresh Kosgi <[email protected]>
>>> wrote:
>>>
>>>> Also more information about why you need a certain order for processing
>>>> would help in recommending how to approach the problem
>>>>
>>>> On Wed, Jan 21, 2015 at 2:28 PM, Naresh Kosgi <[email protected]>
>>>> wrote:
>>>>
>>>>> Storm as a framework does not guarantee order.  You will have to code
>>>>> it if you would like your tuples processed in certain order
>>>>>
>>>>> On Wed, Jan 21, 2015 at 2:24 PM, Shawn Bonnin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Resending...
>>>>>>
>>>>>> Our use case requires the tuples be processed in order across
>>>>>> failures.
>>>>>>
>>>>>> So we have SpoutA sending data to bolt B &C and Bolt D is the last
>>>>>> bolt that aggregates data from B & C and writes to a database.
>>>>>>
>>>>>> We want to make sure that when we use tuple at a time processing OR
>>>>>> use the Trident API, the data always gets processed in the same order as 
>>>>>> it
>>>>>> was read by our spout. Given that between Bolt B & C there would be
>>>>>> parallelism and intermittent failures, my question is  the following -
>>>>>>
>>>>>> How does Storm guarantee processing order of tuples?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> On Wed, Jan 21, 2015 at 10:57 AM, Shawn Bonnin <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Our use case requires the tuples be processed in order across
>>>>>>> failures.
>>>>>>>
>>>>>>> So we have SpoutA sending data to bolt B &C and Bolt D is the last
>>>>>>> bolt that aggregates data from B & C and writes to a database.
>>>>>>>
>>>>>>> We want to make sure that when we use tuple at a time processing OR
>>>>>>> use the Trident API, the data always gets processed in the same order 
>>>>>>> as it
>>>>>>> was read by our spout. Given that between Bolt B & C there would be
>>>>>>> parallelism and intermittent failures, my question is  the following -
>>>>>>>
>>>>>>> How does Storm guarantee processing order of tuples?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Twitter: @nathanmarz
>> http://nathanmarz.com
>>
>
>


-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: How does storm guarantee order of tuples processed?

Reply via email to