Hi Peer, If you open the driver UI (running on port 4040) you can see the stages and the tasks happening inside it. Best way to identify the bottleneck for a stage is to see if there's any time spending on GC, and how many tasks are there per stage (it should be a number > total # cores to achieve max parallelism). Also you can see for each task how long does it take etc into consideration.
Thanks Best Regards On Tue, May 19, 2015 at 12:58 PM, Peer, Oded <[email protected]> wrote: > I am running Spark over Cassandra to process a single table. > > My task reads a single days’ worth of data from the table and performs 50 > group by and distinct operations, counting distinct userIds by different > grouping keys. > > My code looks like this: > > > > JavaRdd<Row> rdd = sc.parallelize().mapPartitions().cache() // reads > the data from the table > > for each groupingKey { > > JavaPairRdd<GroupingKey, UserId> groupByRdd = rdd.mapToPair(); > > JavaPairRDD<GroupingKey, Long> countRdd = > groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values > per grouping key > > } > > > > The distinct() stage takes about 2 minutes for every groupByValue, and my > task takes well over an hour to complete. > > My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size > is 4 GB. > > > > How can I identify the bottleneck more accurately? Is it caused by > shuffling data? > > How can I improve the performance? > > > > Thanks, > > Oded >
