Re: Questions about file_upload method in BigQueryIO

Reuven Lax via user Tue, 08 Oct 2024 10:42:12 -0700

Is this a batch of streaming job?

On Tue, Oct 8, 2024 at 10:25 AM hsy...@gmail.com <hsy...@gmail.com> wrote:


> It looks like the COPY job failed because the TEMP table was removed. @Reuven
> Lax <re...@google.com>  Is that possible? Is there a way to avoid
> that. Or even better is there a way to force writing to destination table
> directly? Thanks!
>
> On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <re...@google.com> wrote:
>
>> By default the file is in json format. You can provide a formatter to
>> allow it to be in AVRO format instead, which will be more efficient.
>>
>> The temp tables are only created if file sizes are too large for a single
>> load into BQ (if you use an AVRO formatter you might be able to reduce file
>> size enough to avoid this). In this case, Beam will issue a copy job to
>> copy all the temp tables to the final table.
>>
>> On Wed, Oct 2, 2024 at 2:42 PM hsy...@gmail.com <hsy...@gmail.com> wrote:
>>
>>> @Reuven Lax <re...@google.com>  I do see the file_upload create tons of
>>> temp tables, but when does BQ load temp tables to the final table?
>>>
>>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <user@beam.apache.org>
>>> wrote:
>>>
>>>> File load does not return per-row errors (unlike storage API which
>>>> does). Dataflow will generally retry the entire file load on error
>>>> (indefinitely for streaming and up to 3 times for batch). You can look at
>>>> the logs to find the specific error, however it can be tricky to associate
>>>> it with a specific row.
>>>>
>>>> Reuven
>>>>
>>>> On Wed, Oct 2, 2024 at 1:08 PM hsy...@gmail.com <hsy...@gmail.com>
>>>> wrote:
>>>>
>>>>> Any best practice for error handling for file upload job?
>>>>>
>>>>> On Wed, Oct 2, 2024 at 1:04 PM hsy...@gmail.com <hsy...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the
>>>>>> storage api cost alone is too high for us, that's why we want to switch 
>>>>>> to
>>>>>> file upload
>>>>>>
>>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <user@beam.apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Have you checked
>>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>>>>>>>
>>>>>>> autosharding is generally recommended. If the cost is the concern,
>>>>>>> have you checked STORAGE_API_AT_LEAST_ONCE?
>>>>>>>
>>>>>>> On Wed, Oct 2, 2024 at 2:16 PM hsy...@gmail.com <hsy...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We are trying to process over 150TB data(streaming unbound) per day
>>>>>>>> and save them to BQ and it looks like storage api is not economical
>>>>>>>> enough for us.  I tried to use file upload but somehow it doesn't work 
>>>>>>>> and
>>>>>>>> there are not many documents for file upload method online. I have a 
>>>>>>>> few
>>>>>>>> questions regarding the file_upload method in streaming mode.
>>>>>>>> 1. How do I decide numOfFileShards? can I still reply on
>>>>>>>> autosharding?
>>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm
>>>>>>>> not sure if dataflow runner would keep all the data in memory before
>>>>>>>> writing to file? If so even one minute data is too much to be kept in
>>>>>>>> memory and less than one minute means would exceed the api quota. Is 
>>>>>>>> there
>>>>>>>> a way to cap the memory usage like write data to files before trigger 
>>>>>>>> file
>>>>>>>> load job?
>>>>>>>> 3. I also noticed that if there is a file upload job failure, I
>>>>>>>> don't get the error message, so what can I do to handle the error, 
>>>>>>>> what is
>>>>>>>> the best practice in terms of error handling in file_upload method?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Regards,
>>>>>>>> Siyuan
>>>>>>>>
>>>>>>>

Re: Questions about file_upload method in BigQueryIO

Reply via email to