For runners that support Reshuffle, it should be safe to use. Its been
"deprecated" for 7 years, but is still heavily used/often the recommended
way to do things like this. I actually just added a PR
<https://github.com/apache/beam/pull/30049> to undeprecate it earlier
today. Looks like you're using Dataflow, which also has always supported
ReShuffle
<https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion>.

> Also I looked at the code, reshuffle seems doing some groupby work
internally. But I don't really need groupby

Groupby is basically an implementation detail that creates the desired
shuffling behavior in many runners (runners can also override transform
implementations if needed for some primitives like this, but that's another
can of worms). Basically, in order to prevent fusion you need some
operation that does this and GroupBy is one option.

Given that you're using DataFlow, I'd also recommend checking out
https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion which
describes how to do this in more detail.

Thanks,
Danny

On Fri, Jan 19, 2024 at 12:36 PM [email protected] <[email protected]> wrote:

> Also I looked at the code, reshuffle seems doing some groupby work
> internally. But I don't really need groupby
>
> On Fri, Jan 19, 2024 at 9:35 AM [email protected] <[email protected]> wrote:
>
>> ReShuffle is deprecated
>>
>> On Fri, Jan 19, 2024 at 8:25 AM XQ Hu via user <[email protected]>
>> wrote:
>>
>>> I do not think it enforces a reshuffle by just checking the doc here:
>>> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys
>>>
>>> Have you tried to just add ReShuffle after PubsubLiteIO?
>>>
>>> On Thu, Jan 18, 2024 at 8:54 PM [email protected] <[email protected]>
>>> wrote:
>>>
>>>> Hey guys,
>>>>
>>>> I have a question, does withkeys transformation enforce a reshuffle?
>>>>
>>>> My pipeline basically look like this PubsubLiteIO -> ParDo(..) ->
>>>> ParDo() -> BigqueryIO.write()
>>>>
>>>> The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused
>>>> together. But The ParDo is expensive and I want dataflow to have more
>>>> workers to work on that, what's the best way to do that?
>>>>
>>>> Regards,
>>>>
>>>>

Reply via email to