You can implement custom partitioner to do the bucketing. On Sun, Jun 30, 2019 at 5:15 AM Chris Teoh <chris.t...@gmail.com> wrote:
> The closest thing I can think of here is if you have both dataframes > written out using buckets. Hive uses this technique for join optimisation > such that both datasets of the same bucket are read by the same mapper to > achieve map side joins. > > On Sat., 29 Jun. 2019, 9:10 pm jelmer, <jkupe...@gmail.com> wrote: > >> I have 2 dataframes, >> >> Dataframe A which contains 1 element per partition that is gigabytes big >> (an index) >> >> Dataframe B which is made up out of millions of small rows. >> >> I want to join B on A but i want all the work to be done on the executors >> holding the partitions of dataframe A >> >> Is there a way to accomplish this without putting dataframe B in a >> broadcast variable or doing a broadcast join ? >> >>