Thanks Thomas. That's good to know :) Are there any plans to support ordered sources? It seems odd (to me at least) to have a stream-oriented computational model with support for grouping by time (windowing) and yet not provide hooks to exploit the same time-ordering within the stream.
1. May 2017 12:25 by [email protected]: > Within the Beam model, there is no guarantee about the ordering of any > PCollection, nor the ordering of any Iterable produced by a GroupByKey, by > element timestamps or any other comparator. Runners aren't required to > maintain any ordering provided by a source, and do not require sources to > provide any ordering. As such, if you want to process data in sorted order, > currently the only option is to explicitly sort the data. > On Mon, May 1, 2017 at 9:13 AM, <> [email protected]> > wrote: > >> >> I have been trying to figure out the potential efficiency of >> sliding windows. Looking at the TrafficRoutes example - >> >> https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java>> >> - it seems that the GatherStats class explicitly sorts its data (in >> event-time order) within every window for every key. >> (Collections.sort(infoList)). >> Is this necessary? If the data for each key arrives in event-time order and >> that order is maintained as the data flows through the pipeline, then the >> data within each window should already be sorted. For large sliding windows >> with small lags/sliding offsets re-sorting is going to be very inefficient. >> Or is it the case in Beam/DataFlow that even if the underlying data stream >> is ordered, there are no guarantees to the ordering of the data after a >> window transform or GroupByKey has been applied? >> Thanks, >> Bill.>> > >
