Hi Fabian, Merge join does not require your data to be sorted when executing on Beam https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_universal_transforms
Chases, Hans On Fri, 2 Sep 2022 at 08:34, Fabian Peters <[email protected]> wrote: > Good morning Matt, > > Thanks for your quick reply! Unfortunately the inputs are not sorted, so > the Merge Join transform is not an option. I guess I'll have to use > temporary BigQuery tables to handle this. Those pipelines are all bounded, > so this is an option. Or is there an easy option to sort things when > running on Beam? > > I'll create a Jira ticket, no problem. > > cheers > > Fabian > > Am 01.09.2022 um 19:11 schrieb Matt Casters <[email protected]>: > > Hi Fabian, > > Joining rows is indeed the exception in Beam. I would suggest you use the > Merge > Join > <https://hop.apache.org/manual/latest/pipeline/transforms/mergejoin.html> > transforms. > For unbounded pipelines (never ending) that transform will be handled > <https://github.com/apache/hop/blob/master/plugins/engines/beam/src/main/java/org/apache/hop/beam/pipeline/handler/BeamMergeJoinTransformHandler.java> > correctly. > If you don't mind, please create a JIRA case so we can create a similar > handler for the Cartesian product use-case. > The code usually is non-trivial in the massive parallel world but quite > doable ;-) > > All the best, > Matt > > > On Thu, Sep 1, 2022 at 6:37 PM Fabian Peters <[email protected]> wrote: > >> Hi all, >> >> I've hit the next problem, this time something I thought I had testet on >> Beam before: A pipeline containing a "Join rows (cartesian product)" >> transform with input from two sources, loops forever when run via >> Beam-Direct or Dataflow. It works fine using the local runner. >> >> While running it on Beam-Direct I've attached a debugger and can see that >> it is stuck in the while loop at JoinRows.java:486 >> <https://github.com/apache/hop/blob/758c07c360c26c0447251f0a29df81557864ad11/plugins/transforms/joinrows/src/main/java/org/apache/hop/pipeline/transforms/joinrows/JoinRows.java#L487>. >> I've tried using a GCS temp directory and swapped the "Main transform to >> read from" but none of those helped. >> >> Is this transform incompatible with Beam? If so, what could I use instead? >> >> cheers >> >> Fabian >> >> <PastedGraphic-8.png> >> > > > -- > Neo4j Chief Solutions Architect > *✉ *[email protected] > > > > >
