The RDDs themselves are not materialized, but the implementations can
materialize.

E.g. in cogroup (which is used by RDD.join), it materializes all the data
during grouping.

In SQL/DataFrame join, depending on the join:

1. For broadcast join, only the smaller side is materialized in memory as a
hash table.

2. For sort-merge join, both sides are sorted & streamed through --
however, one of the sides need to buffer all the rows having the same join
key in order to perform the join.



On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <rishi80.mis...@gmail.com
> wrote:

> Hi Reynold,
> Can you please elaborate on this. I thought RDD also opens only an
> iterator. Does it get materialized for joins?
>
> Rishi
>
> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com> wrote:
>
>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
>> streams.
>>
>>
>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> in scalding we join with the smaller side on the left, since the smaller
>>> side will get buffered while the bigger side streams through the join.
>>>
>>> looking at CoGroupedRDD i do not get the impression such a distiction is
>>> made. it seems both sided are put into a map that can spill to disk. is
>>> this correct?
>>>
>>> thanks
>>>
>>
>>

Reply via email to