Re: Questions about file_upload method in BigQueryIO

hsy...@gmail.com Tue, 08 Oct 2024 11:13:07 -0700

Yes it is using dataflow runner. I'll give avro a try

On Tue, Oct 8, 2024 at 11:02 AM Reuven Lax <re...@google.com> wrote:


> I would try to use AVRO if possible - it tends to decrease the file size
> by quite a lot, and might get you under the limit for a single load job
> which is 11TB or 10,000 files (depending on the frequency  at which you are
> triggering the loads). JSON tends to blow up the data size quite a bit.
>
> BTW - is this using the Dataflow runner? If so, Beam should never delete
> the temp tables until the copy job is completed.
>
> On Tue, Oct 8, 2024 at 10:49 AM hsy...@gmail.com <hsy...@gmail.com> wrote:
>
>> It is a streaming job
>>
>> On Tue, Oct 8, 2024 at 10:40 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Is this a batch of streaming job?
>>>
>>> On Tue, Oct 8, 2024 at 10:25 AM hsy...@gmail.com <hsy...@gmail.com>
>>> wrote:
>>>
>>>> It looks like the COPY job failed because the TEMP table was removed. 
>>>> @Reuven
>>>> Lax <re...@google.com>  Is that possible? Is there a way to avoid
>>>> that. Or even better is there a way to force writing to destination table
>>>> directly? Thanks!
>>>>
>>>> On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> By default the file is in json format. You can provide a formatter to
>>>>> allow it to be in AVRO format instead, which will be more efficient.
>>>>>
>>>>> The temp tables are only created if file sizes are too large for a
>>>>> single load into BQ (if you use an AVRO formatter you might be able to
>>>>> reduce file size enough to avoid this). In this case, Beam will issue a
>>>>> copy job to copy all the temp tables to the final table.
>>>>>
>>>>> On Wed, Oct 2, 2024 at 2:42 PM hsy...@gmail.com <hsy...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> @Reuven Lax <re...@google.com>  I do see the file_upload create tons
>>>>>> of temp tables, but when does BQ load temp tables to the final table?
>>>>>>
>>>>>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> File load does not return per-row errors (unlike storage API which
>>>>>>> does). Dataflow will generally retry the entire file load on error
>>>>>>> (indefinitely for streaming and up to 3 times for batch). You can look 
>>>>>>> at
>>>>>>> the logs to find the specific error, however it can be tricky to 
>>>>>>> associate
>>>>>>> it with a specific row.
>>>>>>>
>>>>>>> Reuven
>>>>>>>
>>>>>>> On Wed, Oct 2, 2024 at 1:08 PM hsy...@gmail.com <hsy...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Any best practice for error handling for file upload job?
>>>>>>>>
>>>>>>>> On Wed, Oct 2, 2024 at 1:04 PM hsy...@gmail.com <hsy...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the
>>>>>>>>> storage api cost alone is too high for us, that's why we want to 
>>>>>>>>> switch to
>>>>>>>>> file upload
>>>>>>>>>
>>>>>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <
>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Have you checked
>>>>>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>>>>>>>>>>
>>>>>>>>>> autosharding is generally recommended. If the cost is the
>>>>>>>>>> concern, have you checked STORAGE_API_AT_LEAST_ONCE?
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 2, 2024 at 2:16 PM hsy...@gmail.com <hsy...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> We are trying to process over 150TB data(streaming unbound) per
>>>>>>>>>>> day and save them to BQ and it looks like storage api is not 
>>>>>>>>>>> economical
>>>>>>>>>>> enough for us.  I tried to use file upload but somehow it doesn't 
>>>>>>>>>>> work and
>>>>>>>>>>> there are not many documents for file upload method online. I have 
>>>>>>>>>>> a few
>>>>>>>>>>> questions regarding the file_upload method in streaming mode.
>>>>>>>>>>> 1. How do I decide numOfFileShards? can I still reply on
>>>>>>>>>>> autosharding?
>>>>>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm
>>>>>>>>>>> not sure if dataflow runner would keep all the data in memory before
>>>>>>>>>>> writing to file? If so even one minute data is too much to be kept 
>>>>>>>>>>> in
>>>>>>>>>>> memory and less than one minute means would exceed the api quota. 
>>>>>>>>>>> Is there
>>>>>>>>>>> a way to cap the memory usage like write data to files before 
>>>>>>>>>>> trigger file
>>>>>>>>>>> load job?
>>>>>>>>>>> 3. I also noticed that if there is a file upload job failure, I
>>>>>>>>>>> don't get the error message, so what can I do to handle the error, 
>>>>>>>>>>> what is
>>>>>>>>>>> the best practice in terms of error handling in file_upload method?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Regards,
>>>>>>>>>>> Siyuan
>>>>>>>>>>>
>>>>>>>>>>

Re: Questions about file_upload method in BigQueryIO

Reply via email to