I was referring to the categorization made by the RDD paper. It describes a narrow dependency as one where every parent partition is required by at most one child partition (e.g. map), whereas a wide dependency means that some parent partitions are required by multiple child partitions (e.g. a join without hash partitioning). The later is significantly more expensive in the case of failure, as an entire lost RDD has to be recomputed, just to restore a single partition.
Here is a screenshot of the relevant diagram from the paper :D http://cl.ly/image/3w2b1s2k2H1g I’m currently building a Datalog system on top of Spark which means hundreds of join iterations, so having fast recovery would be very nice^^. Thanks :), Jan On 03.04.2014, at 22:10, Michael Armbrust <[email protected]> wrote: > I'm sorry, but I don't really understand what you mean when you say "wide" in > this context. For a HashJoin, the only dependencies of the produced RDD are > the two input RDDs. For BroadcastNestedLoopJoin The only dependence will be > on the streamed RDD. The other RDD will be distributed to all nodes using a > Broadcast variable. > > Michael > > > On Thu, Apr 3, 2014 at 12:59 PM, Jan-Paul Bultmann <[email protected]> > wrote: > Hey, > Does somebody know the kinds of dependencies that the new SQL operators > produce? > I’m specifically interested in the relational join operation as it seems > substantially more optimized. > > The old join was narrow on two RDDs with the same partitioner. > Is the relational join narrow as well? > > Cheers Jan >
