Re: Ensure tuples are processed in order while still avoiding bottlenecks

Sam Mati Thu, 13 Nov 2014 22:17:21 -0800

I'm unaware of any way to do this built in to Storm.  In general I've found 
Storm to have a very bare bones feature set.


I think your instinct of having a separate "Sort Bolt" that reads from the 
Spout is the way to go, and it's simple:

When receiving from the Spout:  Add the ID to a List.
When receiving from Bolt2:  Add the ID to a HashSet and call "emitNext()"
EmitNext:  If the first item in the list is in the HashSet, you can emit it and 
remove it from both and call "emitNext()" again.  Otherwise, do nothing.

You can then parallelize Bolt1 but not the Sort Bolt.

Alternatively, you can have the spout emit "previous-id" with each tuple and 
the Sort Bolt can be clever about the last one it emitted and what it has 
waiting.

- Sam

From: Bryan Hernandez <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, November 11, 2014 11:08 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Ensure tuples are processed in order while still avoiding 
bottlenecks

I should clarify that I am aware I could make Bolt2 also subscribe to Spout1, 
so that it knows the correct order.  However, I am wondering if there is a 
built-in Storm way of handling this requirement in general.

Thanks!

Best,

Bryan


On Tue, Nov 11, 2014 at 5:03 PM, Bryan Hernandez 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I'd like to know if there is a way to do the following in Storm:

The topology:

Spout1 -> Bolt1 -> Bolt2

Spout1: emits about 1 tuple per second.
Bolt1: execute() method takes, on average, 5 seconds to process each tuple.
Bolt2: must receive tuples in the same order that they were emitted from Spout1.

As I understand it, without parallelization, Bolt1's input queue should grow by 
4 tuples every 5 seconds.  This, of course, would overflow eventually.  
However, if I set the parralelism_hint argument of Bolt1 equal to 5, then it 
should be fine.

Here's the problem:

I cannot guarantee that the processing time in Bolt1 will always be 5 seconds.  
So it could be that a tuple received by Bolt1 later in time is emitted before 
tuples that were received earlier than it.  In other words, using parallelism, 
I could have Bolt2 receiving [t2, t1, t3], for tuples emitted from Spout1 as 
[t1, t2, t3].

Is there a way to make sure that 1) Bolt2 receives the tuples in order, as well 
as 2) ensuring the Bolt1 doesn't fall behind of the emission rate in of Spout1?

Thanks!

Best,
Bryan

Re: Ensure tuples are processed in order while still avoiding bottlenecks

Reply via email to