On Mon, May 1, 2017 at 10:46 AM, wrote:
> Yes, I understand there's no explicit underlying time-ordering within the
> stream. What I am getting at is that the notion of windowing in Beam and
> Dataflow does rely on there being at least an implicit weak ordering
I have sort of a similar usecase when dealing with failed / cancelled / broken
streaming pipelines.
We have an operator that continuously monitors the min-watermark of the
pipeline and when it detects that the watermark is not advancing for more than
some threshold. We start a new pipeline and
You should also be able to simply add a Bounded Read from the backup data
source to your pipeline and flatten it with your Pubsub topic. Because all
of the elements produced by both the bounded and unbounded sources will
have consistent timestamps, when you run the pipeline the watermark will be
Yes, precisely.
I think that could work, yes. What you are suggesting sounds like idea 2)
in my original question.
My main concern is that I would have to allow a great deal of lateness and
that old windows would consume too much memory. Whether it works in my case
or not I don't know yet as I
Within the Beam model, there is no guarantee about the ordering of any
PCollection, nor the ordering of any Iterable produced by a GroupByKey, by
element timestamps or any other comparator. Runners aren't required to
maintain any ordering provided by a source, and do not require sources to
provide
I believe that if your data from the past can't effect the data of the
future because the windows/state are independent of each other then just
reprocessing the old data using a batch job is simplest and likely to be
the fastest.
About your choices 1, 2, and 3, allowed lateness is relative to the
I have been trying to figure out the potential efficiency of sliding windows.
Looking at the TrafficRoutes example -
https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java
- it seems that the
OK, so the messages are "re-publish" on the topic, with the same timestamp as
the original and consume again by the pipeline.
Maybe, you can play with the allowed lateness and late firings ?
Something like:
Window.into(FixedWindows.of(Duration.minutes(xx)))
Hi Jean-Baptiste,
I think the key point in my case is that I have to process or reprocess
"old" messages. That is, messages that are late because they are streamed
from an archive file and are older than the allowed lateness in the
pipeline.
In the case I described the messages had already been
Hi Lars,
interesting use case indeed ;)
Just to understand: if possible, you don't want to re-consume the messages from
the PubSub topic right ? So, you want to "hold" the PCollections for late data
processing ?
Regards
JB
On 05/01/2017 04:15 PM, Lars BK wrote:
Hi,
Is there a preferred
Hi,
Is there a preferred way of approaching reprocessing historic data with
streaming jobs?
I want to pose this as a general question, but I'm working with Pubsub and
Dataflow specifically. I am a fan of the idea of replaying/fast forwarding
through historic data to reproduce results (as you
11 matches
Mail list logo