Data skew is still a problem with Spark.

- If you use groupByKey, try to express your logic by not using groupByKey.
- If you need to use groupByKey, all you can do is to scale vertically.
- If you can, repartition with a finer HashPartitioner. You will have many
tasks for each stage, but tasks are light-weight in Spark, so it should not
introduce a heavy overhead. If you have your own domain-partitioner, try to
rewrite it by introducing a secondary-key.

I hope I gave some insights and help.

On Fri, Aug 14, 2015 at 9:37 AM Jeff Zhang <zjf...@gmail.com> wrote:

> Data skew ? May your partition key has some special value like null or
> empty string
>
> On Fri, Aug 14, 2015 at 11:01 AM, randylu <randyl...@gmail.com> wrote:
>
>>   It is strange that there are always two tasks slower than others, and
>> the
>> corresponding partitions's data are larger, no matter how many partitions?
>>
>>
>> Executor ID     Address                 Task Time       Shuffle Read Size
>> /
>> Records
>> 1       slave129.vsvs.com:56691 16 s    1       99.5 MB / 18865432
>> *10     slave317.vsvs.com:59281 0 ms    0       413.5 MB / 311001318*
>> 100     slave290.vsvs.com:60241 19 s    1       110.8 MB / 27075926
>> 101     slave323.vsvs.com:36246 14 s    1       126.1 MB / 25052808
>>
>>   Task time and records of Executor 10 seems strange, and the cpus on the
>> node are all 100% busy.
>>
>>   Anyone meets the same problem,  Thanks in advance for any answer!
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Always-two-tasks-slower-than-others-and-then-job-fails-tp24257.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Reply via email to