Thanks for the code sample,

when I switched to use bigquery_file_loads.BigQueryBatchFileLoads instead
of bigquery.WriteToBigQuery it works ok now. Not sure why with
WriteToBigQuery doesn't work, since it's using BigQueryBatchFileLoads under
the hood...

Thanks for the help.
Zdenko
_______________________
 http://www.the-swamp.info



On Wed, Sep 4, 2019 at 6:55 PM Chamikara Jayalath <[email protected]>
wrote:

> +Pablo Estrada <[email protected]> who added this.
>
> I don't think we have tested this specific option but I believe additional
> BQ parameters option was added in a generic way to accept all additional
> parameters.
>
> Looking at the code, seems like additional parameters do get passed
> through to load jobs:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427
>
> One thing you can try out is trying to run a BQ load job directly with the
> same set of data and options to see if the data gets loaded.
>
> Thanks,
> Cham
>
> On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek <[email protected]> wrote:
>
>> Greetings,
>>
>> I am using Beam 2.15 and Python 2.7.
>> I am doing a batch job to load data from CSV and upload to BigQuery. I
>> like functionality that instead of streaming to BigQuery I can use "file
>> load", to load table all at once.
>>
>> For my case, there are few "bad" records in the input (it's geo data and
>> during manual upload, BigQuery doesn't accept those as valid geography
>> records. this is easily solved by setting the number of max bad records.
>> If I understand correctly, WriteToBigQuery supports
>> "additional_bq_parameters", but for some reason when running a pipeline on
>> Dataflow runner it looks like those settings are ignored.
>>
>> I played with an example from the documentation
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
>>  with
>> gist file
>> https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
>> where the table should be created on the partitioned field and clustered,
>> but when running on Dataflow it doesn't happen.
>> When I run on DirectRunner it works as expected. interestingly, when I
>> add maxBadRecords parameter to additional_bq_parameters, DirectRunner
>> complains that it doesn't recognize that option.
>>
>> This is the first time using this setup/combination so I'm just wondering
>> if I overlooked something. I would appreciate any help.
>>
>> Best regards,
>> Zdenko
>>
>>
>> _______________________
>>  http://www.the-swamp.info
>>
>>

Reply via email to