STORAGE_API_AT_LEAST_ONCE only saves dataflow engine cost, but the storage
api cost alone is too high for us, that's why we want to switch to file
upload

On Wed, Oct 2, 2024 at 12:08 PM XQ Hu via user <user@beam.apache.org> wrote:

> Have you checked
> https://cloud.google.com/dataflow/docs/guides/write-to-bigquery?
>
> autosharding is generally recommended. If the cost is the concern, have
> you checked STORAGE_API_AT_LEAST_ONCE?
>
> On Wed, Oct 2, 2024 at 2:16 PM hsy...@gmail.com <hsy...@gmail.com> wrote:
>
>> We are trying to process over 150TB data(streaming unbound) per day and
>> save them to BQ and it looks like storage api is not economical enough for
>> us.  I tried to use file upload but somehow it doesn't work and there are
>> not many documents for file upload method online. I have a few questions
>> regarding the file_upload method in streaming mode.
>> 1. How do I decide numOfFileShards? can I still reply on autosharding?
>> 2. I noticed the fileloads method requires much more memory, I'm not sure
>> if dataflow runner would keep all the data in memory before writing to
>> file? If so even one minute data is too much to be kept in memory and less
>> than one minute means would exceed the api quota. Is there a way to cap the
>> memory usage like write data to files before trigger file load job?
>> 3. I also noticed that if there is a file upload job failure, I don't get
>> the error message, so what can I do to handle the error, what is the best
>> practice in terms of error handling in file_upload method?
>>
>> Thanks!
>> Regards,
>> Siyuan
>>
>

Reply via email to