The RDDs themselves are not materialized, but the implementations can materialize.
E.g. in cogroup (which is used by RDD.join), it materializes all the data during grouping. In SQL/DataFrame join, depending on the join: 1. For broadcast join, only the smaller side is materialized in memory as a hash table. 2. For sort-merge join, both sides are sorted & streamed through -- however, one of the sides need to buffer all the rows having the same join key in order to perform the join. On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <rishi80.mis...@gmail.com > wrote: > Hi Reynold, > Can you please elaborate on this. I thought RDD also opens only an > iterator. Does it get materialized for joins? > > Rishi > > On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com> wrote: > >> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side >> streams. >> >> >> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com> >> wrote: >> >>> in scalding we join with the smaller side on the left, since the smaller >>> side will get buffered while the bigger side streams through the join. >>> >>> looking at CoGroupedRDD i do not get the impression such a distiction is >>> made. it seems both sided are put into a map that can spill to disk. is >>> this correct? >>> >>> thanks >>> >> >>