Hi all,

Thanks for the replies.

@Ahmed, you mentioned that one could hardcode another value
for DEFAULT_MAX_FILE_SIZE. How may I do that from my own code?

@Reuven, to give you more context on my use case: I'm running into an issue
where a job that writes to BQ is taking an unexpectedly long time. It looks
like things are slowing down on the BQ load job side of things. My theory
is that the pipeline might generate too many BQ load job requests for BQ to
handle in a timely manner. So I was thinking that this could be mitigated
by increasing the file size, and therefore reducing the number of load job
requests.

That said, now that you've pointed at withMaxBytesPerPartition(), maybe
that's what I should use instead? I see this defaults to 11TiB but perhaps
I could try increasing it  to something closer to BQ's limit (15TiB)?

Thanks,

Julien

On Thu, Sep 29, 2022 at 11:01 AM Ahmed Abualsaud via user <
[email protected]> wrote:

> That's right, if maxFileSize is made too small you may hit the default
> maximum files per partition (10,000), in which case copy jobs will be
> triggered. With that said though, BigQueryIO already has a public
> withMaxBytesPerPartition() [1] method that controls the partition byte
> size, which is arguably more influential in triggering this other codepath.
>
> [1]
> https://github.com/apache/beam/blob/028c564b8ae1ba1ffa6aadb8212ec03555dd63b6/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2623
>
> On Thu, Sep 29, 2022 at 12:24 PM Reuven Lax <[email protected]> wrote:
>
>> It's not public because it was added for use in unit tests, and modifying
>> this value can have very unexpected results (e.g. making it smaller can
>> trigger a completely different codepath that is triggered when there are
>> too many files, leading to unexpected cost increases in the pipeline).
>>
>> Out of curiosity, what is your use case for needing to control this file
>> size?
>>
>> On Thu, Sep 29, 2022 at 8:01 AM Ahmed Abualsaud <
>> [email protected]> wrote:
>>
>>> Hey Julien,
>>>
>>> I don't see a problem with exposing that method. That part of the code
>>> was committed ~6 years ago, my guess is it wasn't requested to be public.
>>>
>>> One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1].
>>> Would this work temporarily? @Chamikara Jayalath <[email protected]> 
>>> @Reuven
>>> Lax <[email protected]> other thoughts?
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109
>>>
>>> Best,
>>> Ahmed
>>>
>>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to control the size of files written to GCS when using
>>>> BigQueryIO's FILE_LOAD write method.
>>>>
>>>> However, it looks like the withMaxFileSize method (
>>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>>> is not public.
>>>>
>>>> Is that intentional? Is there a workaround to control the file size?
>>>>
>>>> Thanks,
>>>>
>>>> Julien
>>>>
>>>
>>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to control the size of files written to GCS when using
>>>> BigQueryIO's FILE_LOAD write method.
>>>>
>>>> However, it looks like the withMaxFileSize method (
>>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>>> is not public.
>>>>
>>>> Is that intentional? Is there a workaround to control the file size?
>>>>
>>>> Thanks,
>>>>
>>>> Julien
>>>>
>>>

Reply via email to