Re: Efficiency question

billsmith31415 Mon, 01 May 2017 09:53:22 -0700

Thanks Thomas. That's good to know :) Are there any plans to support ordered 
sources? It seems odd (to me at least) to have a stream-oriented computational 
model with support for grouping by time (windowing) and yet not provide hooks 
to exploit the same time-ordering within the stream.


1. May 2017 12:25 by [email protected]:


> Within the Beam model, there is no guarantee about the ordering of any 
> PCollection, nor the ordering of any Iterable produced by a GroupByKey, by 
> element timestamps or any other comparator. Runners aren't required to 
> maintain any ordering provided by a source, and do not require sources to 
> provide any ordering. As such, if you want to process data in sorted order, 
> currently the only option is to explicitly sort the data.
> On Mon, May 1, 2017 at 9:13 AM,  <> [email protected]> > wrote:
>
>>           >> I have been trying to figure out the potential efficiency of 
>> sliding windows. Looking at the TrafficRoutes example - >> 
>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java>>
>>   -  it seems that the GatherStats class explicitly sorts its data (in 
>> event-time order) within every window for every key. 
>> (Collections.sort(infoList)). 
>> Is this necessary? If the data for each key arrives in event-time order and 
>> that order is maintained as the data flows through the pipeline, then the 
>> data within each window should already be sorted. For large sliding windows 
>> with small lags/sliding offsets re-sorting is going to be very inefficient. 
>> Or is it the case in Beam/DataFlow that even if the underlying data stream 
>> is ordered, there are no guarantees to the ordering of the data after a 
>> window transform or GroupByKey has been applied? 
>> Thanks,
>> Bill.>>   
>
>

Re: Efficiency question

Reply via email to