That's right, if maxFileSize is made too small you may hit the default maximum files per partition (10,000), in which case copy jobs will be triggered. With that said though, BigQueryIO already has a public withMaxBytesPerPartition() [1] method that controls the partition byte size, which is arguably more influential in triggering this other codepath.
[1] https://github.com/apache/beam/blob/028c564b8ae1ba1ffa6aadb8212ec03555dd63b6/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2623 On Thu, Sep 29, 2022 at 12:24 PM Reuven Lax <[email protected]> wrote: > It's not public because it was added for use in unit tests, and modifying > this value can have very unexpected results (e.g. making it smaller can > trigger a completely different codepath that is triggered when there are > too many files, leading to unexpected cost increases in the pipeline). > > Out of curiosity, what is your use case for needing to control this file > size? > > On Thu, Sep 29, 2022 at 8:01 AM Ahmed Abualsaud <[email protected]> > wrote: > >> Hey Julien, >> >> I don't see a problem with exposing that method. That part of the code >> was committed ~6 years ago, my guess is it wasn't requested to be public. >> >> One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1]. >> Would this work temporarily? @Chamikara Jayalath <[email protected]> >> @Reuven >> Lax <[email protected]> other thoughts? >> >> [1] >> https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109 >> >> Best, >> Ahmed >> >> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <[email protected]> wrote: >> >>> Hi, >>> >>> I'd like to control the size of files written to GCS when using >>> BigQueryIO's FILE_LOAD write method. >>> >>> However, it looks like the withMaxFileSize method ( >>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597) >>> is not public. >>> >>> Is that intentional? Is there a workaround to control the file size? >>> >>> Thanks, >>> >>> Julien >>> >> >> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <[email protected]> wrote: >> >>> Hi, >>> >>> I'd like to control the size of files written to GCS when using >>> BigQueryIO's FILE_LOAD write method. >>> >>> However, it looks like the withMaxFileSize method ( >>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597) >>> is not public. >>> >>> Is that intentional? Is there a workaround to control the file size? >>> >>> Thanks, >>> >>> Julien >>> >>
