Data skew is still a problem with Spark. - If you use groupByKey, try to express your logic by not using groupByKey. - If you need to use groupByKey, all you can do is to scale vertically. - If you can, repartition with a finer HashPartitioner. You will have many tasks for each stage, but tasks are light-weight in Spark, so it should not introduce a heavy overhead. If you have your own domain-partitioner, try to rewrite it by introducing a secondary-key.
I hope I gave some insights and help. On Fri, Aug 14, 2015 at 9:37 AM Jeff Zhang <zjf...@gmail.com> wrote: > Data skew ? May your partition key has some special value like null or > empty string > > On Fri, Aug 14, 2015 at 11:01 AM, randylu <randyl...@gmail.com> wrote: > >> It is strange that there are always two tasks slower than others, and >> the >> corresponding partitions's data are larger, no matter how many partitions? >> >> >> Executor ID Address Task Time Shuffle Read Size >> / >> Records >> 1 slave129.vsvs.com:56691 16 s 1 99.5 MB / 18865432 >> *10 slave317.vsvs.com:59281 0 ms 0 413.5 MB / 311001318* >> 100 slave290.vsvs.com:60241 19 s 1 110.8 MB / 27075926 >> 101 slave323.vsvs.com:36246 14 s 1 126.1 MB / 25052808 >> >> Task time and records of Executor 10 seems strange, and the cpus on the >> node are all 100% busy. >> >> Anyone meets the same problem, Thanks in advance for any answer! >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Always-two-tasks-slower-than-others-and-then-job-fails-tp24257.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > Best Regards > > Jeff Zhang >