Re: Why beam.io.BigQuerySource() transform taking so long time to read a bigquery table to P-Collection

Unais Thachuparambil Tue, 16 Jan 2018 22:56:53 -0800

This is my dataflow Job id  2018-01-15_03_45_25-16237456828769339983

On Wed, Jan 17, 2018 at 12:10 AM, Eugene Kirpichov <[email protected]>
wrote:


> After the bigquery export job, Beam needs to also ingest the results of
> the export (Avro files); Beam uses the standard Python avro library for
> that, which has notoriously awful performance; taking 2 hours to read 44GB
> of data would not be surprising. It's possible that this is the root of the
> issue, but we'd need a job id to tell for sure. There are several forks of
> the avro library with better performance but the forks are unmaintained.
>
> On Tue, Jan 16, 2018 at 11:53 AM Chamikara Jayalath <[email protected]>
> wrote:
>
>> BQ Read step involves running a query job in BQ followed by running a
>> export job to export the resulting table to GCS. Is it possible that these
>> jobs took a long time for some reason ?
>>
>> Dataflow job should log the BQ job IDs of these jobs and it should be
>> possible to check the status using following command.
>>
>> bq show -j --project_id=<GCP project ID> <BQ job ID>
>>
>> Feel free to mention your job ID in Dataflow SDK's stackoverflow channel
>> if you want Dataflow team to take a look.
>> https://stackoverflow.com/questions/tagged/dataflow
>>
>> - Cham
>>
>>
>>
>> On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil <
>> [email protected]> wrote:
>>
>>>
>>> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB)
>>> using beam.io.BigQuerySource() by running a simple query
>>>
>>>  """
>>> SELECT
>>>   filed1,
>>>   filed2,
>>>   filed3,
>>>   filed4,
>>>   filed5,
>>>   filed6
>>> FROM
>>>   TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'),
>>> TIMESTAMP('{end_date}'))
>>> WHERE
>>>   filed1 IS NOT NULL
>>> """
>>>
>>> after that, I'm partitioning the source data based on field2 date and
>>> converting to date partitioned P-Collections
>>>
>>> But while monitoring the data flow console I noticed that the BQRead
>>> operation taking more than 1hr 40min out of 2hr: 54-minute total execution.
>>>
>>> Why the BQ io read taking a long time? Is there any implemented method
>>> in data flow (I'm using python API) to speed up this process.
>>>
>>> How I can reduce the read io execution time?.
>>>
>>> Screenshot of graph is attached (Time showed on the graph is wrong - It
>>> took 2hr 54-min to finish)
>>>
>>> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png]
>>> 
>>>
>>

Re: Why beam.io.BigQuerySource() transform taking so long time to read a bigquery table to P-Collection

Reply via email to