Within the Beam model, there is no guarantee about the ordering of any PCollection, nor the ordering of any Iterable produced by a GroupByKey, by element timestamps or any other comparator. Runners aren't required to maintain any ordering provided by a source, and do not require sources to provide any ordering. As such, if you want to process data in sorted order, currently the only option is to explicitly sort the data.
On Mon, May 1, 2017 at 9:13 AM, <[email protected]> wrote: > I have been trying to figure out the potential efficiency of sliding > windows. Looking at the TrafficRoutes example - https://github.com/ > GoogleCloudPlatform/DataflowJavaSDK-examples/blob/ > master/src/main/java/com/google/cloud/dataflow/examples/complete/ > TrafficRoutes.java - it seems that the GatherStats class explicitly > sorts its data (in event-time order) within every window for every key. > (Collections.sort(infoList)). > > Is this necessary? If the data for each key arrives in event-time order > and that order is maintained as the data flows through the pipeline, then > the data within each window should already be sorted. For large sliding > windows with small lags/sliding offsets re-sorting is going to be very > inefficient. Or is it the case in Beam/DataFlow that even if the underlying > data stream is ordered, there are no guarantees to the ordering of the data > after a window transform or GroupByKey has been applied? > > Thanks, > > Bill. >
