Spark. Efficiency. toDebugString understanding

Eugene Morozov Thu, 25 Jun 2015 17:14:00 -0700

Hi, I’m trying to understand toDebugString feature as how to get insights out 
of it. It seems a little confusing. Can anybody, please help me with 
understanding?


So, I have following example. First number is id, second is actual data, but 
for the sake of example it’s number of RDD.
RDD 1: (1, 1), (2, 1), (3, 1)
RDD 2: (3, 2), (4, 2), (5, 2)
RDD 3: (2, 3), (3, 3), (5, 3)

Number of RDD might be thought as a recency, so I have to get the most recent 
data in the result.
In the result I have to have: (1, 1), (2, 3), (3, 3), (4, 2), (5, 3).

It’s possible to achieve this by different ways. Actual code a little bit more 
complex, but the ideas are:
1: rdd1.union(rdd2).union(rdd3).reduceByKey(v2);
2. rdd1.fullOuterJoin(rdd2).reduceByKey(v2).fullOuterJoin(rdd3).reduceByKey(v2)
3. rdd1.cogroup(rdd2).reduceByKey(v2).cogroup(rdd3).reduceByKey(v2); //cogroup 
allows to group more, than one rdd, but it’s an example.

So, toDebugString look like:
union
(12) ShuffledRDD[8] at reduceByKey at DatasetsTest.java:67 []
 +-(12) UnionRDD[7] at union at DatasetsTest.java:67 []
    |   UnionRDD[6] at union at DatasetsTest.java:67 []
    |   MapPartitionsRDD[3] at mapToPair at DatasetsTest.java:63 []
    |   ParallelCollectionRDD[0] at parallelize at DatasetsTest.java:60 []
    |   MapPartitionsRDD[4] at mapToPair at DatasetsTest.java:64 []
    |   ParallelCollectionRDD[1] at parallelize at DatasetsTest.java:61 []
    |   MapPartitionsRDD[5] at mapToPair at DatasetsTest.java:65 []
    |   ParallelCollectionRDD[2] at parallelize at DatasetsTest.java:62 []

cogroup
(4) MapPartitionsRDD[16] at mapToPair at DatasetsTest.java:84 []
 |  MapPartitionsRDD[15] at cogroup at DatasetsTest.java:84 []
 |  MapPartitionsRDD[14] at cogroup at DatasetsTest.java:84 []
 |  CoGroupedRDD[13] at cogroup at DatasetsTest.java:84 []
 +-(4) MapPartitionsRDD[12] at mapToPair at DatasetsTest.java:84 []
 |  |  MapPartitionsRDD[11] at cogroup at DatasetsTest.java:84 []
 |  |  MapPartitionsRDD[10] at cogroup at DatasetsTest.java:84 []
 |  |  CoGroupedRDD[9] at cogroup at DatasetsTest.java:84 []
 |  +-(4) MapPartitionsRDD[3] at mapToPair at DatasetsTest.java:63 []
 |  |  |  ParallelCollectionRDD[0] at parallelize at DatasetsTest.java:60 []
 |  +-(4) MapPartitionsRDD[4] at mapToPair at DatasetsTest.java:64 []
 |     |  ParallelCollectionRDD[1] at parallelize at DatasetsTest.java:61 []
 +-(4) MapPartitionsRDD[5] at mapToPair at DatasetsTest.java:65 []
    |  ParallelCollectionRDD[2] at parallelize at DatasetsTest.java:62 []

join
(4) MapPartitionsRDD[18] at mapToPair at DatasetsTest.java:85 []
 |  MapPartitionsRDD[17] at fullOuterJoin at DatasetsTest.java:85 []
 |  MapPartitionsRDD[16] at fullOuterJoin at DatasetsTest.java:85 []
 |  MapPartitionsRDD[15] at fullOuterJoin at DatasetsTest.java:85 []
 |  CoGroupedRDD[14] at fullOuterJoin at DatasetsTest.java:85 []
 +-(4) MapPartitionsRDD[13] at mapToPair at DatasetsTest.java:85 []
 |  |  MapPartitionsRDD[12] at fullOuterJoin at DatasetsTest.java:85 []
 |  |  MapPartitionsRDD[11] at fullOuterJoin at DatasetsTest.java:85 []
 |  |  MapPartitionsRDD[10] at fullOuterJoin at DatasetsTest.java:85 []
 |  |  CoGroupedRDD[9] at fullOuterJoin at DatasetsTest.java:85 []
 |  +-(4) MapPartitionsRDD[3] at mapToPair at DatasetsTest.java:64 []
 |  |  |  ParallelCollectionRDD[0] at parallelize at DatasetsTest.java:61 []
 |  +-(4) MapPartitionsRDD[4] at mapToPair at DatasetsTest.java:65 []
 |     |  ParallelCollectionRDD[1] at parallelize at DatasetsTest.java:62 []
 +-(4) MapPartitionsRDD[5] at mapToPair at DatasetsTest.java:66 []
    |  ParallelCollectionRDD[2] at parallelize at DatasetsTest.java:63 []


The problem is I don’t really understand what will happen. My current 
understanding is that number in braces (12 and 4) are number of partitions to 
process. Different indentation means different stages. Different stage means 
different shuffle, which in turn means the data will be sent over the network. 
So, based on these I assume that there will be just one network hop for union 
and 4 network hops for both join and cogroup. 
Is my understanding correct?

If it is, then union is the fastest way of doing what I need, but it doesn’t 
guarantee ordering. So, it might make sense to transform RDDs into RDDs with 
some number first, then do union and that’ll still be faster, than other 
solutions.
Is my conclusion here correct?

One more question. Even though I do not change keys in fullOuterJoin and 
cogroup, it appers that nevertheless Spark does reshuffle. Why it does so?

Thanks in advance.

--
Eugene Morozov
fathers...@list.ru

Spark. Efficiency. toDebugString understanding

Reply via email to