This is a *fantastic* question. The idea of how we identify individual things in multiple DStreams is worth looking at.
The reason being, that you can then fine tune your streaming job, based on the RDD identifiers (i.e. are the timestamps from the producer correlating closely to the order in which RDD elements are being produced) ? If *NO* then you need to (1) dial up throughput on producer sources or else (2) increase cluster size so that spark is capable of evenly handling load. You cant decide to do (1) or (2) unless you can track when the streaming elements are being converted to RDDs by spark itself. On Wed, Feb 18, 2015 at 6:54 PM, Neelesh <neele...@gmail.com> wrote: > There does not seem to be a definitive answer on this. Every time I google > for message ordering,the only relevant thing that comes up is this - > http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html > . > > With a kafka receiver that pulls data from a single kafka partition of a > kafka topic, are individual messages in the microbatch in same the order as > kafka partition? Are successive microbatches originating from a kafka > partition executed in order? > > > Thanks! > > -- jay vyas