That's right, if maxFileSize is made too small you may hit the default
maximum files per partition (10,000), in which case copy jobs will be
triggered. With that said though, BigQueryIO already has a public
withMaxBytesPerPartition() [1] method that controls the partition byte
size, which is arguably more influential in triggering this other codepath.

[1]
https://github.com/apache/beam/blob/028c564b8ae1ba1ffa6aadb8212ec03555dd63b6/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2623

On Thu, Sep 29, 2022 at 12:24 PM Reuven Lax <[email protected]> wrote:

> It's not public because it was added for use in unit tests, and modifying
> this value can have very unexpected results (e.g. making it smaller can
> trigger a completely different codepath that is triggered when there are
> too many files, leading to unexpected cost increases in the pipeline).
>
> Out of curiosity, what is your use case for needing to control this file
> size?
>
> On Thu, Sep 29, 2022 at 8:01 AM Ahmed Abualsaud <[email protected]>
> wrote:
>
>> Hey Julien,
>>
>> I don't see a problem with exposing that method. That part of the code
>> was committed ~6 years ago, my guess is it wasn't requested to be public.
>>
>> One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1].
>> Would this work temporarily? @Chamikara Jayalath <[email protected]> 
>> @Reuven
>> Lax <[email protected]> other thoughts?
>>
>> [1]
>> https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109
>>
>> Best,
>> Ahmed
>>
>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I'd like to control the size of files written to GCS when using
>>> BigQueryIO's FILE_LOAD write method.
>>>
>>> However, it looks like the withMaxFileSize method (
>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>> is not public.
>>>
>>> Is that intentional? Is there a workaround to control the file size?
>>>
>>> Thanks,
>>>
>>> Julien
>>>
>>
>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I'd like to control the size of files written to GCS when using
>>> BigQueryIO's FILE_LOAD write method.
>>>
>>> However, it looks like the withMaxFileSize method (
>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>> is not public.
>>>
>>> Is that intentional? Is there a workaround to control the file size?
>>>
>>> Thanks,
>>>
>>> Julien
>>>
>>

Reply via email to