Current BQ source is a bounded source. Basically you are reading a SNAPSHOT of a BQ table at a given point in time. It's possible to use the BQ source (and any other bounded source) from a streaming pipeline. This will result in an automatic bounded to unbounded converter being invoked that produces a bare bones bounded source that might not scale well as you noticed.
- Cham On Thu, Apr 23, 2020 at 6:39 AM Aniruddh Sharma <[email protected]> wrote: > Adding the subject line. > > On 2020/04/23 13:38:16, Aniruddh Sharma <[email protected]> wrote: > > Hello > > > > I want to read a BQ table which has billions of rows. I am using > Streaming mode and using EXORT method. > > > > Read is running very slow (seems like in batches) and my job is super > slow. Intent of this query is to find what different settings can be > applied to maximize the read throughput from BQ. > > > > a) I notice in BigQueryOptions there are some options to control the > concurrency of Writes in BQ, but don't find any such options in READ. Can > there be some settings either in DF or BQ to say to read more data and in > parallel in BQ. > > > > b) I start from numWorkers=10 and maxWorkers=1000, and it constantly > runs on 10 workers, Dataflow does not apply autoscaling, somehow it does > not determine that it can spin up to 1000 workers and have billion of rows > pending to be read and it can spin more machines and read. It doesn't do > that. > > > > Any guidance will help. > > > > Thanks > > Aniruddh > > > > > > >
