This is my dataflow Job id 2018-01-15_03_45_25-16237456828769339983 On Wed, Jan 17, 2018 at 12:10 AM, Eugene Kirpichov <[email protected]> wrote:
> After the bigquery export job, Beam needs to also ingest the results of > the export (Avro files); Beam uses the standard Python avro library for > that, which has notoriously awful performance; taking 2 hours to read 44GB > of data would not be surprising. It's possible that this is the root of the > issue, but we'd need a job id to tell for sure. There are several forks of > the avro library with better performance but the forks are unmaintained. > > On Tue, Jan 16, 2018 at 11:53 AM Chamikara Jayalath <[email protected]> > wrote: > >> BQ Read step involves running a query job in BQ followed by running a >> export job to export the resulting table to GCS. Is it possible that these >> jobs took a long time for some reason ? >> >> Dataflow job should log the BQ job IDs of these jobs and it should be >> possible to check the status using following command. >> >> bq show -j --project_id=<GCP project ID> <BQ job ID> >> >> Feel free to mention your job ID in Dataflow SDK's stackoverflow channel >> if you want Dataflow team to take a look. >> https://stackoverflow.com/questions/tagged/dataflow >> >> - Cham >> >> >> >> On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil < >> [email protected]> wrote: >> >>> >>> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB) >>> using beam.io.BigQuerySource() by running a simple query >>> >>> """ >>> SELECT >>> filed1, >>> filed2, >>> filed3, >>> filed4, >>> filed5, >>> filed6 >>> FROM >>> TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'), >>> TIMESTAMP('{end_date}')) >>> WHERE >>> filed1 IS NOT NULL >>> """ >>> >>> after that, I'm partitioning the source data based on field2 date and >>> converting to date partitioned P-Collections >>> >>> But while monitoring the data flow console I noticed that the BQRead >>> operation taking more than 1hr 40min out of 2hr: 54-minute total execution. >>> >>> Why the BQ io read taking a long time? Is there any implemented method >>> in data flow (I'm using python API) to speed up this process. >>> >>> How I can reduce the read io execution time?. >>> >>> Screenshot of graph is attached (Time showed on the graph is wrong - It >>> took 2hr 54-min to finish) >>> >>> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png] >>> >>> >>
