Thanks for the reply! We are using these pipelines to read structured log lines from Kafka and storing them in bigquery.
.withMaxNumRecords(<n>) or .withMaxReadTime(<duration>) aren't that useful because they do not remember how much they have read in previous run. On Mon, Mar 13, 2017 at 9:42 PM, Kenneth Knowles <k...@google.com.invalid> wrote: > This seems like a good topic for user@ so I've moved it there (dev@ to > BCC). > > You can get a bounded PCollection from KafkaIO via either of > .withMaxNumRecords(<n>) or .withMaxReadTime(<duration>). > > Whether or not that will meet your use case would depend on more details of > what you are computing. Periodic batch jobs are harder to get right. In > particular, the time you stop reading and the end of a window (esp. > sessions) are not likely to coincide, so you'll need to deal with that. > > Kenn > > On Mon, Mar 13, 2017 at 6:09 PM, Arpan Jain <jainar...@gmail.com> wrote: > > > Hi, > > > > We run multiple streaming pipelines using cloud dataflow that read from > > Kafka and write to BigQuery. We don't mind a few hours delay and are > > thinking of avoiding the costs associated with streaming data into > > BigQuery. Is there already a support (or a future plan) for such a > > scenario? If not then I guess I will implement one of the following > option: > > * A BoundedSource implementation for Kafka so that we can run this in > > batch mode. > > * The streaming job writes to GCS and then a BQ load job writes to > > BigQuery. > > > > Thanks! > > >