On Mon, May 1, 2017 at 9:53 AM, <[email protected]> wrote: > > Thanks Thomas. That's good to know :) Are there any plans to support ordered > sources?
It's been discussed, but there are no concrete plans. > It seems odd (to me at least) to have a stream-oriented > computational model with support for grouping by time (windowing) and yet > not provide hooks to exploit the same time-ordering within the stream. There is actually no underlying time-ordering within the stream. The elements are grouped by buffering elements as they come in and tracking a watermark (which is a signal that "all the data up to timestamp T has now been collected, see [1] for more details) to release completed groups downstream (depending on the triggering). A runner that actually guaranteed delivery time-ordered delivery of elements could provide this and group more efficiently, but that would likely impose a higher cost of maintaining a distributed total order. [1] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 > 1. May 2017 12:25 by [email protected]: > > > Within the Beam model, there is no guarantee about the ordering of any > PCollection, nor the ordering of any Iterable produced by a GroupByKey, by > element timestamps or any other comparator. Runners aren't required to > maintain any ordering provided by a source, and do not require sources to > provide any ordering. As such, if you want to process data in sorted order, > currently the only option is to explicitly sort the data. > > On Mon, May 1, 2017 at 9:13 AM, <[email protected]> wrote: >> >> I have been trying to figure out the potential efficiency of sliding >> windows. Looking at the TrafficRoutes example - >> https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java >> - it seems that the GatherStats class explicitly sorts its data (in >> event-time order) within every window for every key. >> (Collections.sort(infoList)). >> >> Is this necessary? If the data for each key arrives in event-time order >> and that order is maintained as the data flows through the pipeline, then >> the data within each window should already be sorted. For large sliding >> windows with small lags/sliding offsets re-sorting is going to be very >> inefficient. Or is it the case in Beam/DataFlow that even if the underlying >> data stream is ordered, there are no guarantees to the ordering of the data >> after a window transform or GroupByKey has been applied? >> >> Thanks, >> >> Bill. > >
