This seems like a good topic for user@ so I've moved it there (dev@ to BCC).
You can get a bounded PCollection from KafkaIO via either of .withMaxNumRecords(<n>) or .withMaxReadTime(<duration>). Whether or not that will meet your use case would depend on more details of what you are computing. Periodic batch jobs are harder to get right. In particular, the time you stop reading and the end of a window (esp. sessions) are not likely to coincide, so you'll need to deal with that. Kenn On Mon, Mar 13, 2017 at 6:09 PM, Arpan Jain <[email protected]> wrote: > Hi, > > We run multiple streaming pipelines using cloud dataflow that read from > Kafka and write to BigQuery. We don't mind a few hours delay and are > thinking of avoiding the costs associated with streaming data into > BigQuery. Is there already a support (or a future plan) for such a > scenario? If not then I guess I will implement one of the following option: > * A BoundedSource implementation for Kafka so that we can run this in > batch mode. > * The streaming job writes to GCS and then a BQ load job writes to > BigQuery. > > Thanks! >
