Re: Efficiency question

Robert Bradshaw Mon, 01 May 2017 10:18:53 -0700

On Mon, May 1, 2017 at 9:53 AM,  <[email protected]> wrote:
>
> Thanks Thomas. That's good to know :) Are there any plans to support ordered
> sources?


It's been discussed, but there are no concrete plans.

> It seems odd (to me at least) to have a stream-oriented
> computational model with support for grouping by time (windowing) and yet
> not provide hooks to exploit the same time-ordering within the stream.

There is actually no underlying time-ordering within the stream. The
elements are grouped by buffering elements as they come in and
tracking a watermark (which is a signal that "all the data up to
timestamp T has now been collected, see [1] for more details) to
release completed groups downstream (depending on the triggering). A
runner that actually guaranteed delivery time-ordered delivery of
elements could provide this and group more efficiently, but that would
likely impose a higher cost of maintaining a distributed total order.

[1] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

> 1. May 2017 12:25 by [email protected]:
>
>
> Within the Beam model, there is no guarantee about the ordering of any
> PCollection, nor the ordering of any Iterable produced by a GroupByKey, by
> element timestamps or any other comparator. Runners aren't required to
> maintain any ordering provided by a source, and do not require sources to
> provide any ordering. As such, if you want to process data in sorted order,
> currently the only option is to explicitly sort the data.
>
> On Mon, May 1, 2017 at 9:13 AM, <[email protected]> wrote:
>>
>> I have been trying to figure out the potential efficiency of sliding
>> windows. Looking at the TrafficRoutes example -
>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java
>> -  it seems that the GatherStats class explicitly sorts its data (in
>> event-time order) within every window for every key.
>> (Collections.sort(infoList)).
>>
>> Is this necessary? If the data for each key arrives in event-time order
>> and that order is maintained as the data flows through the pipeline, then
>> the data within each window should already be sorted. For large sliding
>> windows with small lags/sliding offsets re-sorting is going to be very
>> inefficient. Or is it the case in Beam/DataFlow that even if the underlying
>> data stream is ordered, there are no guarantees to the ordering of the data
>> after a window transform or GroupByKey has been applied?
>>
>> Thanks,
>>
>> Bill.
>
>

Re: Efficiency question

Reply via email to