I'm unaware of any way to do this built in to Storm. In general I've found Storm to have a very bare bones feature set.
I think your instinct of having a separate "Sort Bolt" that reads from the Spout is the way to go, and it's simple: When receiving from the Spout: Add the ID to a List. When receiving from Bolt2: Add the ID to a HashSet and call "emitNext()" EmitNext: If the first item in the list is in the HashSet, you can emit it and remove it from both and call "emitNext()" again. Otherwise, do nothing. You can then parallelize Bolt1 but not the Sort Bolt. Alternatively, you can have the spout emit "previous-id" with each tuple and the Sort Bolt can be clever about the last one it emitted and what it has waiting. - Sam From: Bryan Hernandez <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Tuesday, November 11, 2014 11:08 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Ensure tuples are processed in order while still avoiding bottlenecks I should clarify that I am aware I could make Bolt2 also subscribe to Spout1, so that it knows the correct order. However, I am wondering if there is a built-in Storm way of handling this requirement in general. Thanks! Best, Bryan On Tue, Nov 11, 2014 at 5:03 PM, Bryan Hernandez <[email protected]<mailto:[email protected]>> wrote: Hi, I'd like to know if there is a way to do the following in Storm: The topology: Spout1 -> Bolt1 -> Bolt2 Spout1: emits about 1 tuple per second. Bolt1: execute() method takes, on average, 5 seconds to process each tuple. Bolt2: must receive tuples in the same order that they were emitted from Spout1. As I understand it, without parallelization, Bolt1's input queue should grow by 4 tuples every 5 seconds. This, of course, would overflow eventually. However, if I set the parralelism_hint argument of Bolt1 equal to 5, then it should be fine. Here's the problem: I cannot guarantee that the processing time in Bolt1 will always be 5 seconds. So it could be that a tuple received by Bolt1 later in time is emitted before tuples that were received earlier than it. In other words, using parallelism, I could have Bolt2 receiving [t2, t1, t3], for tuples emitted from Spout1 as [t1, t2, t3]. Is there a way to make sure that 1) Bolt2 receives the tuples in order, as well as 2) ensuring the Bolt1 doesn't fall behind of the emission rate in of Spout1? Thanks! Best, Bryan
