Re: Sources of bounded data in Beam on the Flink runner

Lukasz Cwik Wed, 30 Mar 2016 09:20:06 -0700

Your mental model is not off.

Conceptually within Beam you should be able to read from a bounded and
unbounded PCollection as you describe. With flatten you would create one
PCollection which contains the contents of both. Its important that the
watermark tracking for the bounded source follows the timestamps in your
old data otherwise that data may be considered "late" and may interact
poorly with the triggers specified on your window.into


Curious as to why you want to reprocess the old data?
I have seen usecases where people want old data to affect their streaming
pipeline. Depending on how much old data, it may be worthwhile to record
the post processed version somewhere and load that directly vs reprocessing
the entire old data set.

On Wed, Mar 30, 2016 at 7:37 AM, Bill McCarthy <[email protected]>
wrote:

> Thanks Max,
>
> I'm unsure what you mean by the "batch part" and "streaming execution"...
> Are you saying that I have to run my entire pipeline in either batch or
> streaming modes?
>
> Perhaps a brief description of what I'm trying to do would help:
>
> 1. I've got some historical intervalized timeseries data
> 2. I've also got some live intervalized timeseries data
> 3. I want to process both those flows of data in a uniform fashion:
> calculating windowed statistics
> 4. My thought was that I'd have the historical data stored in some easy
> data store for Beam (e.g. HDFS)
> 5. I'd put live data on Kafka
> 6. Then load up 2 PCollections, one from HDFS and one from Kafka
> 7. Then perform a Flatten transform to get them in one PCollection
> 8. Then run my windowing over the flattened PCollection
> 9. ...
>
> I like this approach, because my transforms don't have to care about
> whether data is live or historical and I can have one processing pipeline
> for both.
>
> Am I barking up the wrong tree? Is my mental model way off, with respect
> to how to combine historical and live data?
>
> Thanks
>
> Bill
>
> On Mar 30, 2016, at 5:59 AM, Maximilian Michels <[email protected]> wrote:
>
> Hi Bill,
>
> The batch part of the Flink Runners supports reading from a finite
> collection, but I'm assuming you're working with the streaming execution of
> the Flink runner. We haven't implemented support in the Runner yet but
> Flink natively supports reading from finite sources. So it looks fairly
> easy to implement. I've filed a JIRA issue and would like to look into this
> later today.
>
> I've recently added a test case (UnboundedSourceITCase) which throws a
> custom exception when the end of the output has been reached. Admittedly,
> that is not a very nice approach but it works. All the more, we should
> support finite sources in streaming mode.
>
> Best,
> Max
>
> On Wed, Mar 30, 2016 at 2:43 AM, William McCarthy <
> [email protected]> wrote:
>
>> Hi,
>>
>> I want to get access to a bounded PCollection in Beam on the Flink
>> runner. Ideally, I’d pull from HDFS. Is that supported? If not, what would
>> be the best way for me to get something bounded, for testing purposes?
>>
>> Thanks,
>>
>> Bill
>
>
>

Re: Sources of bounded data in Beam on the Flink runner

Reply via email to