The default max file size is 4Tib. BigQuery supports files up to 5Tib, but
there might be some slop in our file-size estimation which is why Beam set
a slightly lower limit. In any case, you won't be able to increase that
value by too much, or BigQuery will reject the load job.

The default max bytes per partition maybe can be increased. When the code
was written, BigQuery's max limit was 12 Tib, but if it's now 15 TiB that
would be a reason to increase it.

BigQuery does not provide guarantees on scheduling load jobs (especially if
you don't have reserved slots). Some other ideas for how to improve things:
    - If you are running in streaming mode, then consider increasing the
triggering duration so you generate load jobs less often.
    - By default, files are written out in json format. This is inefficient
and tends to create many more files. There is currently partial support for
writing files in a more-efficient AVRO format, but it requires you to call
withAvroWriter to pass in a function that converts your records into AVRO.
    - I would also recommend trying the storage API write method. This does
not have the same issues with scheduling that load jobs have.

Reuven

On Thu, Sep 29, 2022 at 1:02 PM Julien Phalip <jpha...@gmail.com> wrote:

> Hi all,
>
> Thanks for the replies.
>
> @Ahmed, you mentioned that one could hardcode another value
> for DEFAULT_MAX_FILE_SIZE. How may I do that from my own code?
>
> @Reuven, to give you more context on my use case: I'm running into an
> issue where a job that writes to BQ is taking an unexpectedly long time. It
> looks like things are slowing down on the BQ load job side of things. My
> theory is that the pipeline might generate too many BQ load job requests
> for BQ to handle in a timely manner. So I was thinking that this could be
> mitigated by increasing the file size, and therefore reducing the number of
> load job requests.
>
> That said, now that you've pointed at withMaxBytesPerPartition(), maybe
> that's what I should use instead? I see this defaults to 11TiB but perhaps
> I could try increasing it  to something closer to BQ's limit (15TiB)?
>
> Thanks,
>
> Julien
>
> On Thu, Sep 29, 2022 at 11:01 AM Ahmed Abualsaud via user <
> user@beam.apache.org> wrote:
>
>> That's right, if maxFileSize is made too small you may hit the default
>> maximum files per partition (10,000), in which case copy jobs will be
>> triggered. With that said though, BigQueryIO already has a public
>> withMaxBytesPerPartition() [1] method that controls the partition byte
>> size, which is arguably more influential in triggering this other codepath.
>>
>> [1]
>> https://github.com/apache/beam/blob/028c564b8ae1ba1ffa6aadb8212ec03555dd63b6/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2623
>>
>> On Thu, Sep 29, 2022 at 12:24 PM Reuven Lax <re...@google.com> wrote:
>>
>>> It's not public because it was added for use in unit tests, and
>>> modifying this value can have very unexpected results (e.g. making it
>>> smaller can trigger a completely different codepath that is triggered when
>>> there are too many files, leading to unexpected cost increases in the
>>> pipeline).
>>>
>>> Out of curiosity, what is your use case for needing to control this file
>>> size?
>>>
>>> On Thu, Sep 29, 2022 at 8:01 AM Ahmed Abualsaud <
>>> ahmedabuals...@google.com> wrote:
>>>
>>>> Hey Julien,
>>>>
>>>> I don't see a problem with exposing that method. That part of the code
>>>> was committed ~6 years ago, my guess is it wasn't requested to be public.
>>>>
>>>> One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1].
>>>> Would this work temporarily? @Chamikara Jayalath <chamik...@google.com>
>>>>  @Reuven Lax <re...@google.com> other thoughts?
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109
>>>>
>>>> Best,
>>>> Ahmed
>>>>
>>>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jpha...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'd like to control the size of files written to GCS when using
>>>>> BigQueryIO's FILE_LOAD write method.
>>>>>
>>>>> However, it looks like the withMaxFileSize method (
>>>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>>>> is not public.
>>>>>
>>>>> Is that intentional? Is there a workaround to control the file size?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Julien
>>>>>
>>>>
>>>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jpha...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'd like to control the size of files written to GCS when using
>>>>> BigQueryIO's FILE_LOAD write method.
>>>>>
>>>>> However, it looks like the withMaxFileSize method (
>>>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>>>> is not public.
>>>>>
>>>>> Is that intentional? Is there a workaround to control the file size?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Julien
>>>>>
>>>>

Reply via email to