Re: Questions about file_upload method in BigQueryIO

Reuven Lax via user Sun, 06 Oct 2024 12:35:38 -0700

By default the file is in json format. You can provide a formatter to allow
it to be in AVRO format instead, which will be more efficient.


The temp tables are only created if file sizes are too large for a single
load into BQ (if you use an AVRO formatter you might be able to reduce file
size enough to avoid this). In this case, Beam will issue a copy job to
copy all the temp tables to the final table.

On Wed, Oct 2, 2024 at 2:42 PM hsy...@gmail.com <hsy...@gmail.com> wrote:

> @Reuven Lax <re...@google.com>  I do see the file_upload create tons of
> temp tables, but when does BQ load temp tables to the final table?
>
> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <user@beam.apache.org>
> wrote:
>
>> File load does not return per-row errors (unlike storage API which does).
>> Dataflow will generally retry the entire file load on error (indefinitely
>> for streaming and up to 3 times for batch). You can look at the logs to
>> find the specific error, however it can be tricky to associate it with a
>> specific row.
>>
>> Reuven
>>
>> On Wed, Oct 2, 2024 at 1:08 PM hsy...@gmail.com <hsy...@gmail.com> wrote:
>>
>>> Any best practice for error handling for file upload job?
>>>
>>> On Wed, Oct 2, 2024 at 1:04 PM hsy...@gmail.com <hsy...@gmail.com>
>>> wrote:
>>>
>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the
>>>> storage api cost alone is too high for us, that's why we want to switch to
>>>> file upload
>>>>
>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <user@beam.apache.org>
>>>> wrote:
>>>>
>>>>> Have you checked
>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>>>>>
>>>>> autosharding is generally recommended. If the cost is the concern,
>>>>> have you checked STORAGE_API_AT_LEAST_ONCE?
>>>>>
>>>>> On Wed, Oct 2, 2024 at 2:16 PM hsy...@gmail.com <hsy...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We are trying to process over 150TB data(streaming unbound) per day
>>>>>> and save them to BQ and it looks like storage api is not economical
>>>>>> enough for us.  I tried to use file upload but somehow it doesn't work 
>>>>>> and
>>>>>> there are not many documents for file upload method online. I have a few
>>>>>> questions regarding the file_upload method in streaming mode.
>>>>>> 1. How do I decide numOfFileShards? can I still reply on
>>>>>> autosharding?
>>>>>> 2. I noticed the fileloads method requires much more memory, I'm not
>>>>>> sure if dataflow runner would keep all the data in memory before writing 
>>>>>> to
>>>>>> file? If so even one minute data is too much to be kept in memory and 
>>>>>> less
>>>>>> than one minute means would exceed the api quota. Is there a way to cap 
>>>>>> the
>>>>>> memory usage like write data to files before trigger file load job?
>>>>>> 3. I also noticed that if there is a file upload job failure, I don't
>>>>>> get the error message, so what can I do to handle the error, what is the
>>>>>> best practice in terms of error handling in file_upload method?
>>>>>>
>>>>>> Thanks!
>>>>>> Regards,
>>>>>> Siyuan
>>>>>>
>>>>>

Re: Questions about file_upload method in BigQueryIO

Reply via email to