Hi, I have 6 sequence files as input to spark code. What I am doing is: 1. Create 6 individual RDD's out of them. 2. Union them. 3. Then Some Mapping. 4. Count no of ele in RDD. 5. Then SortByKey.
Now, If I see logging: 14/01/03 09:04:04 INFO scheduler.DAGScheduler: Got job 0 (count at PreBaseCubeCreator.scala:96) with 6 output partitions (allowLocal=false) This is count step (4th) Doubt 1: Why 6 output partitions? It then prints progress for each of them *14/01/03 09:04:05 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager guavus-000392:52345 with 47.4 GB RAM14/01/03 09:04:08 INFO cluster.ClusterTaskSetManager: Finished TID 5 in 3938 ms on guavus-000392 (progress: 1/6)14/01/03 09:04:08 INFO scheduler.DAGScheduler: Completed ResultTask(0, 5)14/01/03 09:04:09 INFO cluster.ClusterTaskSetManager: Finished TID 4 in 4211 ms on guavus-000392 (progress: 2/6)14/01/03 09:04:09 INFO scheduler.DAGScheduler: Completed ResultTask(0, 4)14/01/03 09:04:09 INFO cluster.ClusterTaskSetManager: Finished TID 1 in 4221 ms on guavus-000392 (progress: 3/6)14/01/03 09:04:09 INFO scheduler.DAGScheduler: Completed ResultTask(0, 1)14/01/03 09:04:10 INFO cluster.ClusterTaskSetManager: Finished TID 0 in 5581 ms on guavus-000392 (progress: 4/6)14/01/03 09:04:10 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)14/01/03 09:04:12 INFO cluster.ClusterTaskSetManager: Finished TID 3 in 7830 ms on guavus-000392 (progress: 5/6)14/01/03 09:04:12 INFO scheduler.DAGScheduler: Completed ResultTask(0, 3)14/01/03 09:04:20 INFO cluster.ClusterTaskSetManager: Finished TID 2 in 15786 ms on guavus-000392 (progress: 6/6)14/01/03 09:04:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 2)14/01/03 09:04:20 INFO scheduler.DAGScheduler: Stage 0 (count at PreBaseCubeCreator.scala:96) finished in 16.320 s14/01/03 09:04:20 INFO cluster.ClusterScheduler: Remove TaskSet 0.0 from pool14/01/03 09:04:20 INFO spark.SparkContext: Job finished: count* After that when it goes to sortByKey: *14/01/03 09:04:33 INFO scheduler.DAGScheduler: Got job 2 (sortByKey at PreBaseCubeCreator.scala:98) with 6 output partitions (allowLocal=false)* However, It should have been n output partitions, where n = unique no. of keys in RDD. Isn't it? Thanks and Regards, Archit Thakur.
