Yes, I understand there's no explicit underlying time-ordering within the stream. What I am getting at is that the notion of windowing in Beam and Dataflow does rely on there being at least an implicit weak ordering within the stream (without it, you could never issue a watermark - essentially the assumption of "useful" watermarks can be treated as the definition of weak-ordering). Often that weak ordering is in fact strong ordering in practice, yet we can't exploit it in that case. I am not sure maintaining a distributed total order would be more costly or necessary. For example you could distribute over key and then only require total order on a per-key basis. Anyway, thanks for the clarification - very helpful. --Bill
1. May 2017 13:18 by [email protected]: > On Mon, May 1, 2017 at 9:53 AM, <> [email protected]> > wrote: >> >> Thanks Thomas. That's good to know :) Are there any plans to support ordered >> sources? > > It's been discussed, but there are no concrete plans. > >> It seems odd (to me at least) to have a stream-oriented >> computational model with support for grouping by time (windowing) and yet >> not provide hooks to exploit the same time-ordering within the stream. > > There is actually no underlying time-ordering within the stream. The > elements are grouped by buffering elements as they come in and > tracking a watermark (which is a signal that "all the data up to > timestamp T has now been collected, see [1] for more details) to > release completed groups downstream (depending on the triggering). A > runner that actually guaranteed delivery time-ordered delivery of > elements could provide this and group more efficiently, but that would > likely impose a higher cost of maintaining a distributed total order. > > [1] > https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 > >> 1. May 2017 12:25 by >> [email protected]>> : >> >> >> Within the Beam model, there is no guarantee about the ordering of any >> PCollection, nor the ordering of any Iterable produced by a GroupByKey, by >> element timestamps or any other comparator. Runners aren't required to >> maintain any ordering provided by a source, and do not require sources to >> provide any ordering. As such, if you want to process data in sorted order, >> currently the only option is to explicitly sort the data. >> >> On Mon, May 1, 2017 at 9:13 AM, <>> [email protected]>> > wrote: >>> >>> I have been trying to figure out the potential efficiency of sliding >>> windows. Looking at the TrafficRoutes example - >>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java >>> - it seems that the GatherStats class explicitly sorts its data (in >>> event-time order) within every window for every key. >>> (Collections.sort(infoList)). >>> >>> Is this necessary? If the data for each key arrives in event-time order >>> and that order is maintained as the data flows through the pipeline, then >>> the data within each window should already be sorted. For large sliding >>> windows with small lags/sliding offsets re-sorting is going to be very >>> inefficient. Or is it the case in Beam/DataFlow that even if the underlying >>> data stream is ordered, there are no guarantees to the ordering of the data >>> after a window transform or GroupByKey has been applied? >>> >>> Thanks, >>> >>> Bill. >> >>
