Hi Peer,

If you open the driver UI (running on port 4040) you can see the stages and
the tasks happening inside it. Best way to identify the bottleneck for a
stage is to see if there's any time spending on GC, and how many tasks are
there per stage (it should be a number > total # cores to achieve max
parallelism). Also you can see for each task how long does it take etc into
consideration.

Thanks
Best Regards

On Tue, May 19, 2015 at 12:58 PM, Peer, Oded <[email protected]> wrote:

>  I am running Spark over Cassandra to process a single table.
>
> My task reads a single days’ worth of data from the table and performs 50
> group by and distinct operations, counting distinct userIds by different
> grouping keys.
>
> My code looks like this:
>
>
>
>    JavaRdd<Row> rdd = sc.parallelize().mapPartitions().cache() // reads
> the data from the table
>
>    for each groupingKey {
>
>       JavaPairRdd<GroupingKey, UserId> groupByRdd = rdd.mapToPair();
>
>       JavaPairRDD<GroupingKey, Long> countRdd =
> groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values
> per grouping key
>
>    }
>
>
>
> The distinct() stage takes about 2 minutes for every groupByValue, and my
> task takes well over an hour to complete.
>
> My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size
> is 4 GB.
>
>
>
> How can I identify the bottleneck more accurately? Is it caused by
> shuffling data?
>
> How can I improve the performance?
>
>
>
> Thanks,
>
> Oded
>

Reply via email to