Is this a batch of streaming job? On Tue, Oct 8, 2024 at 10:25 AM hsy...@gmail.com <hsy...@gmail.com> wrote:
> It looks like the COPY job failed because the TEMP table was removed. @Reuven > Lax <re...@google.com> Is that possible? Is there a way to avoid > that. Or even better is there a way to force writing to destination table > directly? Thanks! > > On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <re...@google.com> wrote: > >> By default the file is in json format. You can provide a formatter to >> allow it to be in AVRO format instead, which will be more efficient. >> >> The temp tables are only created if file sizes are too large for a single >> load into BQ (if you use an AVRO formatter you might be able to reduce file >> size enough to avoid this). In this case, Beam will issue a copy job to >> copy all the temp tables to the final table. >> >> On Wed, Oct 2, 2024 at 2:42 PM hsy...@gmail.com <hsy...@gmail.com> wrote: >> >>> @Reuven Lax <re...@google.com> I do see the file_upload create tons of >>> temp tables, but when does BQ load temp tables to the final table? >>> >>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <user@beam.apache.org> >>> wrote: >>> >>>> File load does not return per-row errors (unlike storage API which >>>> does). Dataflow will generally retry the entire file load on error >>>> (indefinitely for streaming and up to 3 times for batch). You can look at >>>> the logs to find the specific error, however it can be tricky to associate >>>> it with a specific row. >>>> >>>> Reuven >>>> >>>> On Wed, Oct 2, 2024 at 1:08 PM hsy...@gmail.com <hsy...@gmail.com> >>>> wrote: >>>> >>>>> Any best practice for error handling for file upload job? >>>>> >>>>> On Wed, Oct 2, 2024 at 1:04 PM hsy...@gmail.com <hsy...@gmail.com> >>>>> wrote: >>>>> >>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the >>>>>> storage api cost alone is too high for us, that's why we want to switch >>>>>> to >>>>>> file upload >>>>>> >>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <user@beam.apache.org> >>>>>> wrote: >>>>>> >>>>>>> Have you checked >>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>>>> >>>>>>> autosharding is generally recommended. If the cost is the concern, >>>>>>> have you checked STORAGE_API_AT_LEAST_ONCE? >>>>>>> >>>>>>> On Wed, Oct 2, 2024 at 2:16 PM hsy...@gmail.com <hsy...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> We are trying to process over 150TB data(streaming unbound) per day >>>>>>>> and save them to BQ and it looks like storage api is not economical >>>>>>>> enough for us. I tried to use file upload but somehow it doesn't work >>>>>>>> and >>>>>>>> there are not many documents for file upload method online. I have a >>>>>>>> few >>>>>>>> questions regarding the file_upload method in streaming mode. >>>>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>>>> autosharding? >>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm >>>>>>>> not sure if dataflow runner would keep all the data in memory before >>>>>>>> writing to file? If so even one minute data is too much to be kept in >>>>>>>> memory and less than one minute means would exceed the api quota. Is >>>>>>>> there >>>>>>>> a way to cap the memory usage like write data to files before trigger >>>>>>>> file >>>>>>>> load job? >>>>>>>> 3. I also noticed that if there is a file upload job failure, I >>>>>>>> don't get the error message, so what can I do to handle the error, >>>>>>>> what is >>>>>>>> the best practice in terms of error handling in file_upload method? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Regards, >>>>>>>> Siyuan >>>>>>>> >>>>>>>