By default the file is in json format. You can provide a formatter to allow it to be in AVRO format instead, which will be more efficient.
The temp tables are only created if file sizes are too large for a single load into BQ (if you use an AVRO formatter you might be able to reduce file size enough to avoid this). In this case, Beam will issue a copy job to copy all the temp tables to the final table. On Wed, Oct 2, 2024 at 2:42 PM hsy...@gmail.com <hsy...@gmail.com> wrote: > @Reuven Lax <re...@google.com> I do see the file_upload create tons of > temp tables, but when does BQ load temp tables to the final table? > > On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user <user@beam.apache.org> > wrote: > >> File load does not return per-row errors (unlike storage API which does). >> Dataflow will generally retry the entire file load on error (indefinitely >> for streaming and up to 3 times for batch). You can look at the logs to >> find the specific error, however it can be tricky to associate it with a >> specific row. >> >> Reuven >> >> On Wed, Oct 2, 2024 at 1:08 PM hsy...@gmail.com <hsy...@gmail.com> wrote: >> >>> Any best practice for error handling for file upload job? >>> >>> On Wed, Oct 2, 2024 at 1:04 PM hsy...@gmail.com <hsy...@gmail.com> >>> wrote: >>> >>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the >>>> storage api cost alone is too high for us, that's why we want to switch to >>>> file upload >>>> >>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <user@beam.apache.org> >>>> wrote: >>>> >>>>> Have you checked >>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>> >>>>> autosharding is generally recommended. If the cost is the concern, >>>>> have you checked STORAGE_API_AT_LEAST_ONCE? >>>>> >>>>> On Wed, Oct 2, 2024 at 2:16 PM hsy...@gmail.com <hsy...@gmail.com> >>>>> wrote: >>>>> >>>>>> We are trying to process over 150TB data(streaming unbound) per day >>>>>> and save them to BQ and it looks like storage api is not economical >>>>>> enough for us. I tried to use file upload but somehow it doesn't work >>>>>> and >>>>>> there are not many documents for file upload method online. I have a few >>>>>> questions regarding the file_upload method in streaming mode. >>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>> autosharding? >>>>>> 2. I noticed the fileloads method requires much more memory, I'm not >>>>>> sure if dataflow runner would keep all the data in memory before writing >>>>>> to >>>>>> file? If so even one minute data is too much to be kept in memory and >>>>>> less >>>>>> than one minute means would exceed the api quota. Is there a way to cap >>>>>> the >>>>>> memory usage like write data to files before trigger file load job? >>>>>> 3. I also noticed that if there is a file upload job failure, I don't >>>>>> get the error message, so what can I do to handle the error, what is the >>>>>> best practice in terms of error handling in file_upload method? >>>>>> >>>>>> Thanks! >>>>>> Regards, >>>>>> Siyuan >>>>>> >>>>>