Thanks a lot Akhil, after try some suggestions in the tuning guide, there seems no improvement at all.
And below is the job detail when running locally(8cores) which took 3min to complete the job, we can see it is the map operation took most of time, looks like the mapPartitions took too long Is there any additional idea? Thanks a lot. Proust From: Akhil Das <ak...@sigmoidanalytics.com> To: Proust GZ Feng/China/IBM@IBMCN Cc: "user@spark.apache.org" <user@spark.apache.org> Date: 06/15/2015 03:02 PM Subject: Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows Have a look here https://spark.apache.org/docs/latest/tuning.html Thanks Best Regards On Mon, Jun 15, 2015 at 11:27 AM, Proust GZ Feng <pf...@cn.ibm.com> wrote: Hi, Spark Experts I have played with Spark several weeks, after some time testing, a reduce operation of DataFrame cost 40s on a cluster with 5 datanode executors. And the back-end rows is about 6,000, is this a normal case? Such performance looks too bad because in Java a loop for 6,000 rows cause just several seconds I'm wondering any document I should read to make the job much more fast? Thanks in advance Proust