Re: Efficiency question

billsmith31415 Mon, 01 May 2017 10:47:21 -0700

Yes, I understand there's no explicit underlying time-ordering within the 
stream. What I am getting at is that the notion of windowing in Beam and 
Dataflow does rely on there being at least an implicit weak ordering within the 
 stream (without it, you could never issue a watermark - essentially the 
assumption of  "useful" watermarks can be treated as the definition of 
weak-ordering). Often that weak ordering is in fact strong ordering in 
practice, yet we can't exploit it in that case.
I am not sure maintaining a distributed total order would be more costly or 
necessary. For example you could distribute over key and then only require 
total order on a per-key basis. Anyway, thanks for the clarification - very 
helpful.
--Bill



1. May 2017 13:18 by [email protected]:


> On Mon, May 1, 2017 at 9:53 AM,  <> [email protected]> > wrote:
>>
>> Thanks Thomas. That's good to know :) Are there any plans to support ordered
>> sources?
>
> It's been discussed, but there are no concrete plans.
>
>> It seems odd (to me at least) to have a stream-oriented
>> computational model with support for grouping by time (windowing) and yet
>> not provide hooks to exploit the same time-ordering within the stream.
>
> There is actually no underlying time-ordering within the stream. The
> elements are grouped by buffering elements as they come in and
> tracking a watermark (which is a signal that "all the data up to
> timestamp T has now been collected, see [1] for more details) to
> release completed groups downstream (depending on the triggering). A
> runner that actually guaranteed delivery time-ordered delivery of
> elements could provide this and group more efficiently, but that would
> likely impose a higher cost of maintaining a distributed total order.
>
> [1] > https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
>> 1. May 2017 12:25 by >> [email protected]>> :
>>
>>
>> Within the Beam model, there is no guarantee about the ordering of any
>> PCollection, nor the ordering of any Iterable produced by a GroupByKey, by
>> element timestamps or any other comparator. Runners aren't required to
>> maintain any ordering provided by a source, and do not require sources to
>> provide any ordering. As such, if you want to process data in sorted order,
>> currently the only option is to explicitly sort the data.
>>
>> On Mon, May 1, 2017 at 9:13 AM, <>> [email protected]>> > wrote:
>>>
>>> I have been trying to figure out the potential efficiency of sliding
>>> windows. Looking at the TrafficRoutes example -
>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java
>>> -  it seems that the GatherStats class explicitly sorts its data (in
>>> event-time order) within every window for every key.
>>> (Collections.sort(infoList)).
>>>
>>> Is this necessary? If the data for each key arrives in event-time order
>>> and that order is maintained as the data flows through the pipeline, then
>>> the data within each window should already be sorted. For large sliding
>>> windows with small lags/sliding offsets re-sorting is going to be very
>>> inefficient. Or is it the case in Beam/DataFlow that even if the underlying
>>> data stream is ordered, there are no guarantees to the ordering of the data
>>> after a window transform or GroupByKey has been applied?
>>>
>>> Thanks,
>>>
>>> Bill.
>>
>>

Re: Efficiency question

Reply via email to