Yes it is using dataflow runner. I'll give avro a try On Tue, Oct 8, 2024 at 11:02 AM Reuven Lax <re...@google.com> wrote:
> I would try to use AVRO if possible - it tends to decrease the file size > by quite a lot, and might get you under the limit for a single load job > which is 11TB or 10,000 files (depending on the frequency at which you are > triggering the loads). JSON tends to blow up the data size quite a bit. > > BTW - is this using the Dataflow runner? If so, Beam should never delete > the temp tables until the copy job is completed. > > On Tue, Oct 8, 2024 at 10:49 AM hsy...@gmail.com <hsy...@gmail.com> wrote: > >> It is a streaming job >> >> On Tue, Oct 8, 2024 at 10:40 AM Reuven Lax <re...@google.com> wrote: >> >>> Is this a batch of streaming job? >>> >>> On Tue, Oct 8, 2024 at 10:25 AM hsy...@gmail.com <hsy...@gmail.com> >>> wrote: >>> >>>> It looks like the COPY job failed because the TEMP table was removed. >>>> @Reuven >>>> Lax <re...@google.com> Is that possible? Is there a way to avoid >>>> that. Or even better is there a way to force writing to destination table >>>> directly? Thanks! >>>> >>>> On Sun, Oct 6, 2024 at 12:35 PM Reuven Lax <re...@google.com> wrote: >>>> >>>>> By default the file is in json format. You can provide a formatter to >>>>> allow it to be in AVRO format instead, which will be more efficient. >>>>> >>>>> The temp tables are only created if file sizes are too large for a >>>>> single load into BQ (if you use an AVRO formatter you might be able to >>>>> reduce file size enough to avoid this). In this case, Beam will issue a >>>>> copy job to copy all the temp tables to the final table. >>>>> >>>>> On Wed, Oct 2, 2024 at 2:42 PM hsy...@gmail.com <hsy...@gmail.com> >>>>> wrote: >>>>> >>>>>> @Reuven Lax <re...@google.com> I do see the file_upload create tons >>>>>> of temp tables, but when does BQ load temp tables to the final table? >>>>>> >>>>>> On Wed, Oct 2, 2024 at 1:17 PM Reuven Lax via user < >>>>>> user@beam.apache.org> wrote: >>>>>> >>>>>>> File load does not return per-row errors (unlike storage API which >>>>>>> does). Dataflow will generally retry the entire file load on error >>>>>>> (indefinitely for streaming and up to 3 times for batch). You can look >>>>>>> at >>>>>>> the logs to find the specific error, however it can be tricky to >>>>>>> associate >>>>>>> it with a specific row. >>>>>>> >>>>>>> Reuven >>>>>>> >>>>>>> On Wed, Oct 2, 2024 at 1:08 PM hsy...@gmail.com <hsy...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Any best practice for error handling for file upload job? >>>>>>>> >>>>>>>> On Wed, Oct 2, 2024 at 1:04 PM hsy...@gmail.com <hsy...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the >>>>>>>>> storage api cost alone is too high for us, that's why we want to >>>>>>>>> switch to >>>>>>>>> file upload >>>>>>>>> >>>>>>>>> On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user < >>>>>>>>> user@beam.apache.org> wrote: >>>>>>>>> >>>>>>>>>> Have you checked >>>>>>>>>> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery? >>>>>>>>>> >>>>>>>>>> autosharding is generally recommended. If the cost is the >>>>>>>>>> concern, have you checked STORAGE_API_AT_LEAST_ONCE? >>>>>>>>>> >>>>>>>>>> On Wed, Oct 2, 2024 at 2:16 PM hsy...@gmail.com <hsy...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> We are trying to process over 150TB data(streaming unbound) per >>>>>>>>>>> day and save them to BQ and it looks like storage api is not >>>>>>>>>>> economical >>>>>>>>>>> enough for us. I tried to use file upload but somehow it doesn't >>>>>>>>>>> work and >>>>>>>>>>> there are not many documents for file upload method online. I have >>>>>>>>>>> a few >>>>>>>>>>> questions regarding the file_upload method in streaming mode. >>>>>>>>>>> 1. How do I decide numOfFileShards? can I still reply on >>>>>>>>>>> autosharding? >>>>>>>>>>> 2. I noticed the fileloads method requires much more memory, I'm >>>>>>>>>>> not sure if dataflow runner would keep all the data in memory before >>>>>>>>>>> writing to file? If so even one minute data is too much to be kept >>>>>>>>>>> in >>>>>>>>>>> memory and less than one minute means would exceed the api quota. >>>>>>>>>>> Is there >>>>>>>>>>> a way to cap the memory usage like write data to files before >>>>>>>>>>> trigger file >>>>>>>>>>> load job? >>>>>>>>>>> 3. I also noticed that if there is a file upload job failure, I >>>>>>>>>>> don't get the error message, so what can I do to handle the error, >>>>>>>>>>> what is >>>>>>>>>>> the best practice in terms of error handling in file_upload method? >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> Regards, >>>>>>>>>>> Siyuan >>>>>>>>>>> >>>>>>>>>>