After the bigquery export job, Beam needs to also ingest the results of the export (Avro files); Beam uses the standard Python avro library for that, which has notoriously awful performance; taking 2 hours to read 44GB of data would not be surprising. It's possible that this is the root of the issue, but we'd need a job id to tell for sure. There are several forks of the avro library with better performance but the forks are unmaintained.
On Tue, Jan 16, 2018 at 11:53 AM Chamikara Jayalath <[email protected]> wrote: > BQ Read step involves running a query job in BQ followed by running a > export job to export the resulting table to GCS. Is it possible that these > jobs took a long time for some reason ? > > Dataflow job should log the BQ job IDs of these jobs and it should be > possible to check the status using following command. > > bq show -j --project_id=<GCP project ID> <BQ job ID> > > Feel free to mention your job ID in Dataflow SDK's stackoverflow channel > if you want Dataflow team to take a look. > https://stackoverflow.com/questions/tagged/dataflow > > - Cham > > > > On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil < > [email protected]> wrote: > >> >> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB) using >> beam.io.BigQuerySource() by running a simple query >> >> """ >> SELECT >> filed1, >> filed2, >> filed3, >> filed4, >> filed5, >> filed6 >> FROM >> TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'), >> TIMESTAMP('{end_date}')) >> WHERE >> filed1 IS NOT NULL >> """ >> >> after that, I'm partitioning the source data based on field2 date and >> converting to date partitioned P-Collections >> >> But while monitoring the data flow console I noticed that the BQRead >> operation taking more than 1hr 40min out of 2hr: 54-minute total execution. >> >> Why the BQ io read taking a long time? Is there any implemented method in >> data flow (I'm using python API) to speed up this process. >> >> How I can reduce the read io execution time?. >> >> Screenshot of graph is attached (Time showed on the graph is wrong - It >> took 2hr 54-min to finish) >> >> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png] >> >> >
