Re: Batch loading for streaming pipelines

Kenneth Knowles Mon, 13 Mar 2017 21:51:36 -0700

This seems like a good topic for user@ so I've moved it there (dev@ to BCC).

You can get a bounded PCollection from KafkaIO via either of
.withMaxNumRecords(<n>) or .withMaxReadTime(<duration>).

Whether or not that will meet your use case would depend on more details of
what you are computing. Periodic batch jobs are harder to get right. In
particular, the time you stop reading and the end of a window (esp.
sessions) are not likely to coincide, so you'll need to deal with that.

Kenn

On Mon, Mar 13, 2017 at 6:09 PM, Arpan Jain <[email protected]> wrote:

> Hi,
>
> We run multiple streaming pipelines using cloud dataflow that read from
> Kafka and write to BigQuery. We don't mind a few hours delay and are
> thinking of avoiding the costs associated with streaming data into
> BigQuery. Is there already a support (or a future plan) for such a
> scenario? If not then I guess I will implement one of the following option:
> * A BoundedSource implementation for Kafka so that we can run this in
> batch mode.
> * The streaming job writes to GCS and then a BQ load job writes to
> BigQuery.
>
> Thanks!
>

Re: Batch loading for streaming pipelines

Reply via email to