Hi Fabian,

Merge join does not require your data to be sorted when executing on Beam
https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_universal_transforms

Chases,
Hans

On Fri, 2 Sep 2022 at 08:34, Fabian Peters <[email protected]> wrote:

> Good morning Matt,
>
> Thanks for your quick reply! Unfortunately the inputs are not sorted, so
> the Merge Join transform is not an option. I guess I'll have to use
> temporary BigQuery tables to handle this. Those pipelines are all bounded,
> so this is an option. Or is there an easy option to sort things when
> running on Beam?
>
> I'll create a Jira ticket, no problem.
>
> cheers
>
> Fabian
>
> Am 01.09.2022 um 19:11 schrieb Matt Casters <[email protected]>:
>
> Hi Fabian,
>
> Joining rows is indeed the exception in Beam.  I would suggest you use the 
> Merge
> Join
> <https://hop.apache.org/manual/latest/pipeline/transforms/mergejoin.html>
> transforms.
> For unbounded pipelines (never ending) that transform will be handled
> <https://github.com/apache/hop/blob/master/plugins/engines/beam/src/main/java/org/apache/hop/beam/pipeline/handler/BeamMergeJoinTransformHandler.java>
> correctly.
> If you don't mind, please create a JIRA case so we can create a similar
> handler for the Cartesian product use-case.
> The code usually is non-trivial in the massive parallel world but quite
> doable ;-)
>
> All the best,
> Matt
>
>
> On Thu, Sep 1, 2022 at 6:37 PM Fabian Peters <[email protected]> wrote:
>
>> Hi all,
>>
>> I've hit the next problem, this time something I thought I had testet on
>> Beam before: A pipeline containing a "Join rows (cartesian product)"
>> transform with input from two sources, loops forever when run via
>> Beam-Direct or Dataflow. It works fine using the local runner.
>>
>> While running it on Beam-Direct I've attached a debugger and can see that
>> it is stuck in the while loop at JoinRows.java:486
>> <https://github.com/apache/hop/blob/758c07c360c26c0447251f0a29df81557864ad11/plugins/transforms/joinrows/src/main/java/org/apache/hop/pipeline/transforms/joinrows/JoinRows.java#L487>.
>> I've tried using a GCS temp directory and swapped the "Main transform to
>> read from" but none of those helped.
>>
>> Is this transform incompatible with Beam? If so, what could I use instead?
>>
>> cheers
>>
>> Fabian
>>
>> <PastedGraphic-8.png>
>>
>
>
> --
> Neo4j Chief Solutions Architect
> *✉   *[email protected]
>
>
>
>
>

Reply via email to