Re: Batch loading for streaming pipelines

Arpan Jain Mon, 13 Mar 2017 22:35:54 -0700

Thanks for the reply! We are using these pipelines to read structured log
lines from Kafka and storing them in bigquery.


.withMaxNumRecords(<n>) or .withMaxReadTime(<duration>) aren't that useful
because they do not remember how much they have read in previous run.


On Mon, Mar 13, 2017 at 9:42 PM, Kenneth Knowles <k...@google.com.invalid>
wrote:

> This seems like a good topic for user@ so I've moved it there (dev@ to
> BCC).
>
> You can get a bounded PCollection from KafkaIO via either of
> .withMaxNumRecords(<n>) or .withMaxReadTime(<duration>).
>
> Whether or not that will meet your use case would depend on more details of
> what you are computing. Periodic batch jobs are harder to get right. In
> particular, the time you stop reading and the end of a window (esp.
> sessions) are not likely to coincide, so you'll need to deal with that.
>
> Kenn
>
> On Mon, Mar 13, 2017 at 6:09 PM, Arpan Jain <jainar...@gmail.com> wrote:
>
> > Hi,
> >
> > We run multiple streaming pipelines using cloud dataflow that read from
> > Kafka and write to BigQuery. We don't mind a few hours delay and are
> > thinking of avoiding the costs associated with streaming data into
> > BigQuery. Is there already a support (or a future plan) for such a
> > scenario? If not then I guess I will implement one of the following
> option:
> > * A BoundedSource implementation for Kafka so that we can run this in
> > batch mode.
> > * The streaming job writes to GCS and then a BQ load job writes to
> > BigQuery.
> >
> > Thanks!
> >
>

Re: Batch loading for streaming pipelines

Reply via email to