For runners that support Reshuffle, it should be safe to use. Its been "deprecated" for 7 years, but is still heavily used/often the recommended way to do things like this. I actually just added a PR <https://github.com/apache/beam/pull/30049> to undeprecate it earlier today. Looks like you're using Dataflow, which also has always supported ReShuffle <https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion>.
> Also I looked at the code, reshuffle seems doing some groupby work internally. But I don't really need groupby Groupby is basically an implementation detail that creates the desired shuffling behavior in many runners (runners can also override transform implementations if needed for some primitives like this, but that's another can of worms). Basically, in order to prevent fusion you need some operation that does this and GroupBy is one option. Given that you're using DataFlow, I'd also recommend checking out https://cloud.google.com/dataflow/docs/pipeline-lifecycle#prevent_fusion which describes how to do this in more detail. Thanks, Danny On Fri, Jan 19, 2024 at 12:36 PM [email protected] <[email protected]> wrote: > Also I looked at the code, reshuffle seems doing some groupby work > internally. But I don't really need groupby > > On Fri, Jan 19, 2024 at 9:35 AM [email protected] <[email protected]> wrote: > >> ReShuffle is deprecated >> >> On Fri, Jan 19, 2024 at 8:25 AM XQ Hu via user <[email protected]> >> wrote: >> >>> I do not think it enforces a reshuffle by just checking the doc here: >>> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html?highlight=withkeys#apache_beam.transforms.util.WithKeys >>> >>> Have you tried to just add ReShuffle after PubsubLiteIO? >>> >>> On Thu, Jan 18, 2024 at 8:54 PM [email protected] <[email protected]> >>> wrote: >>> >>>> Hey guys, >>>> >>>> I have a question, does withkeys transformation enforce a reshuffle? >>>> >>>> My pipeline basically look like this PubsubLiteIO -> ParDo(..) -> >>>> ParDo() -> BigqueryIO.write() >>>> >>>> The problem is PubsubLiteIO -> ParDo(..) -> ParDo() always fused >>>> together. But The ParDo is expensive and I want dataflow to have more >>>> workers to work on that, what's the best way to do that? >>>> >>>> Regards, >>>> >>>>
