Re: Questions about file_upload method in BigQueryIO

hsy...@gmail.com Wed, 02 Oct 2024 13:45:37 -0700

Thanks, but the data file will be in proto format or json format?

On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <user@beam.apache.org>
wrote:


> File load does not return per-row errors (unlike storage API which does).
> Dataflow will generally retry the entire file load on error (indefinitely
> for streaming and up to 3 times for batch). You can look at the logs to
> find the specific error, however it can be tricky to associate it with a
> specific row.
>
> Reuven
>
> On Wed, Oct 2, 2024 at 1:08 PM hsy...@gmail.com <hsy...@gmail.com> wrote:
>
>> Any best practice for error handling for file upload job?
>>
>> On Wed, Oct 2, 2024 at 1:04 PM hsy...@gmail.com <hsy...@gmail.com> wrote:
>>
>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the
>>> storage api cost alone is too high for us, that's why we want to switch to
>>> file upload
>>>
>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <user@beam.apache.org>
>>> wrote:
>>>
>>>> Have you checked
>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>>>>
>>>> autosharding is generally recommended. If the cost is the concern, have
>>>> you checked STORAGE_API_AT_LEAST_ONCE?
>>>>
>>>> On Wed, Oct 2, 2024 at 2:16 PM hsy...@gmail.com <hsy...@gmail.com>
>>>> wrote:
>>>>
>>>>> We are trying to process over 150TB data(streaming unbound) per day
>>>>> and save them to BQ and it looks like storage api is not economical
>>>>> enough for us.  I tried to use file upload but somehow it doesn't work and
>>>>> there are not many documents for file upload method online. I have a few
>>>>> questions regarding the file_upload method in streaming mode.
>>>>> 1. How do I decide numOfFileShards? can I still reply on autosharding?
>>>>> 2. I noticed the fileloads method requires much more memory, I'm not
>>>>> sure if dataflow runner would keep all the data in memory before writing 
>>>>> to
>>>>> file? If so even one minute data is too much to be kept in memory and less
>>>>> than one minute means would exceed the api quota. Is there a way to cap 
>>>>> the
>>>>> memory usage like write data to files before trigger file load job?
>>>>> 3. I also noticed that if there is a file upload job failure, I don't
>>>>> get the error message, so what can I do to handle the error, what is the
>>>>> best practice in terms of error handling in file_upload method?
>>>>>
>>>>> Thanks!
>>>>> Regards,
>>>>> Siyuan
>>>>>
>>>>

Re: Questions about file_upload method in BigQueryIO

Reply via email to