Re: Any way to make catalyst optimise away join

Jerry Vinokurov Fri, 29 Nov 2019 13:37:13 -0800

This seems like a suboptimal situation for a join. How can Spark know in
advance that all the fields are present and the tables have the same number
of rows? I suppose you could just sort the two frames by id and concatenate
them, but I'm not sure what join optimization is available here.


On Fri, Nov 29, 2019, 4:51 AM jelmer <jkupe...@gmail.com> wrote:

> I have 2 dataframes , lets call them A and B,
>
> A is made up out of [unique_id, field1]
> B is made up out of [unique_id, field2]
>
> The have the exact same number of rows, and every id in A is also present
> in B
>
> if I execute a join like this A.join(B,
> Seq("unique_id")).select($"unique_id", $"field1") then spark will do an
> expensive join even though it does not have to because all the fields it
> needs are in A. is there some trick I can use so that catalyst will
> optimise this join away ?
>

Re: Any way to make catalyst optimise away join

Reply via email to