After the bigquery export job, Beam needs to also ingest the results of the
export (Avro files); Beam uses the standard Python avro library for that,
which has notoriously awful performance; taking 2 hours to read 44GB of
data would not be surprising. It's possible that this is the root of the
issue, but we'd need a job id to tell for sure. There are several forks of
the avro library with better performance but the forks are unmaintained.

On Tue, Jan 16, 2018 at 11:53 AM Chamikara Jayalath <[email protected]>
wrote:

> BQ Read step involves running a query job in BQ followed by running a
> export job to export the resulting table to GCS. Is it possible that these
> jobs took a long time for some reason ?
>
> Dataflow job should log the BQ job IDs of these jobs and it should be
> possible to check the status using following command.
>
> bq show -j --project_id=<GCP project ID> <BQ job ID>
>
> Feel free to mention your job ID in Dataflow SDK's stackoverflow channel
> if you want Dataflow team to take a look.
> https://stackoverflow.com/questions/tagged/dataflow
>
> - Cham
>
>
>
> On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil <
> [email protected]> wrote:
>
>>
>> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB) using
>> beam.io.BigQuerySource() by running a simple query
>>
>>  """
>> SELECT
>>   filed1,
>>   filed2,
>>   filed3,
>>   filed4,
>>   filed5,
>>   filed6
>> FROM
>>   TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'),
>> TIMESTAMP('{end_date}'))
>> WHERE
>>   filed1 IS NOT NULL
>> """
>>
>> after that, I'm partitioning the source data based on field2 date and
>> converting to date partitioned P-Collections
>>
>> But while monitoring the data flow console I noticed that the BQRead
>> operation taking more than 1hr 40min out of 2hr: 54-minute total execution.
>>
>> Why the BQ io read taking a long time? Is there any implemented method in
>> data flow (I'm using python API) to speed up this process.
>>
>> How I can reduce the read io execution time?.
>>
>> Screenshot of graph is attached (Time showed on the graph is wrong - It
>> took 2hr 54-min to finish)
>>
>> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png]
>> ​
>>
>

Reply via email to