Re: Efficiency question

2017-05-01 Thread Robert Bradshaw
On Mon, May 1, 2017 at 10:46 AM, wrote: > Yes, I understand there's no explicit underlying time-ordering within the > stream. What I am getting at is that the notion of windowing in Beam and > Dataflow does rely on there being at least an implicit weak ordering

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Ankur Chauhan
I have sort of a similar usecase when dealing with failed / cancelled / broken streaming pipelines. We have an operator that continuously monitors the min-watermark of the pipeline and when it detects that the watermark is not advancing for more than some threshold. We start a new pipeline and

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Thomas Groh
You should also be able to simply add a Bounded Read from the backup data source to your pipeline and flatten it with your Pubsub topic. Because all of the elements produced by both the bounded and unbounded sources will have consistent timestamps, when you run the pipeline the watermark will be

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Lars BK
Yes, precisely. I think that could work, yes. What you are suggesting sounds like idea 2) in my original question. My main concern is that I would have to allow a great deal of lateness and that old windows would consume too much memory. Whether it works in my case or not I don't know yet as I

Re: Efficiency question

2017-05-01 Thread Thomas Groh
Within the Beam model, there is no guarantee about the ordering of any PCollection, nor the ordering of any Iterable produced by a GroupByKey, by element timestamps or any other comparator. Runners aren't required to maintain any ordering provided by a source, and do not require sources to provide

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Lukasz Cwik
I believe that if your data from the past can't effect the data of the future because the windows/state are independent of each other then just reprocessing the old data using a batch job is simplest and likely to be the fastest. About your choices 1, 2, and 3, allowed lateness is relative to the

Efficiency question

2017-05-01 Thread billsmith31415
I have been trying to figure out the potential efficiency of sliding windows. Looking at the TrafficRoutes example - https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java -  it seems that the

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Jean-Baptiste Onofré
OK, so the messages are "re-publish" on the topic, with the same timestamp as the original and consume again by the pipeline. Maybe, you can play with the allowed lateness and late firings ? Something like: Window.into(FixedWindows.of(Duration.minutes(xx)))

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Lars BK
Hi Jean-Baptiste, I think the key point in my case is that I have to process or reprocess "old" messages. That is, messages that are late because they are streamed from an archive file and are older than the allowed lateness in the pipeline. In the case I described the messages had already been

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Jean-Baptiste Onofré
Hi Lars, interesting use case indeed ;) Just to understand: if possible, you don't want to re-consume the messages from the PubSub topic right ? So, you want to "hold" the PCollections for late data processing ? Regards JB On 05/01/2017 04:15 PM, Lars BK wrote: Hi, Is there a preferred

Reprocessing historic data with streaming jobs

2017-05-01 Thread Lars BK
Hi, Is there a preferred way of approaching reprocessing historic data with streaming jobs? I want to pose this as a general question, but I'm working with Pubsub and Dataflow specifically. I am a fan of the idea of replaying/fast forwarding through historic data to reproduce results (as you