You can implement custom partitioner to do the bucketing.

On Sun, Jun 30, 2019 at 5:15 AM Chris Teoh <chris.t...@gmail.com> wrote:

> The closest thing I can think of here is if you have both dataframes
> written out using buckets. Hive uses this technique for join optimisation
> such that both datasets of the same bucket are read by the same mapper to
> achieve map side joins.
>
> On Sat., 29 Jun. 2019, 9:10 pm jelmer, <jkupe...@gmail.com> wrote:
>
>> I have 2 dataframes,
>>
>> Dataframe A which contains 1 element per partition that is gigabytes big
>> (an index)
>>
>> Dataframe B which is made up out of millions of small rows.
>>
>> I want to join B on A but i want all the work to be done on the executors
>> holding the partitions of dataframe A
>>
>> Is there a way to accomplish this without putting dataframe B in a
>> broadcast variable or doing a broadcast join ?
>>
>>

Reply via email to