Re: group by and distinct performance issue

Akhil Das Tue, 19 May 2015 00:56:24 -0700

Hi Peer,

If you open the driver UI (running on port 4040) you can see the stages and
the tasks happening inside it. Best way to identify the bottleneck for a
stage is to see if there's any time spending on GC, and how many tasks are
there per stage (it should be a number > total # cores to achieve max
parallelism). Also you can see for each task how long does it take etc into
consideration.


Thanks
Best Regards

On Tue, May 19, 2015 at 12:58 PM, Peer, Oded <[email protected]> wrote:

>  I am running Spark over Cassandra to process a single table.
>
> My task reads a single days’ worth of data from the table and performs 50
> group by and distinct operations, counting distinct userIds by different
> grouping keys.
>
> My code looks like this:
>
>
>
>    JavaRdd<Row> rdd = sc.parallelize().mapPartitions().cache() // reads
> the data from the table
>
>    for each groupingKey {
>
>       JavaPairRdd<GroupingKey, UserId> groupByRdd = rdd.mapToPair();
>
>       JavaPairRDD<GroupingKey, Long> countRdd =
> groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values
> per grouping key
>
>    }
>
>
>
> The distinct() stage takes about 2 minutes for every groupByValue, and my
> task takes well over an hour to complete.
>
> My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size
> is 4 GB.
>
>
>
> How can I identify the bottleneck more accurately? Is it caused by
> shuffling data?
>
> How can I improve the performance?
>
>
>
> Thanks,
>
> Oded
>

Re: group by and distinct performance issue

Reply via email to