This seems like a suboptimal situation for a join. How can Spark know in advance that all the fields are present and the tables have the same number of rows? I suppose you could just sort the two frames by id and concatenate them, but I'm not sure what join optimization is available here.
On Fri, Nov 29, 2019, 4:51 AM jelmer <jkupe...@gmail.com> wrote: > I have 2 dataframes , lets call them A and B, > > A is made up out of [unique_id, field1] > B is made up out of [unique_id, field2] > > The have the exact same number of rows, and every id in A is also present > in B > > if I execute a join like this A.join(B, > Seq("unique_id")).select($"unique_id", $"field1") then spark will do an > expensive join even though it does not have to because all the fields it > needs are in A. is there some trick I can use so that catalyst will > optimise this join away ? >