large scheduler delay in pyspark

gen tang Mon, 03 Aug 2015 09:02:48 -0700

Hi,

Recently, I met some problems about scheduler delay in pyspark. I worked
several days on this problem, but not success. Therefore, I come to here to
ask for help.


I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
merge value by "adding" two list

if I do reduceByKey as follows:
   rdd.reduceByKey(lambda a, b: a+b)
It works fine, scheduler delay is less than 10s. However if I do
reduceByKey:
   def f(a, b):
       for i in b:
            if i not in a:
               a.append(i)
       return a
  rdd.reduceByKey(f)
It will cause very large scheduler delay, about 15-20 mins.(The data I deal
with is about 300 mb, and I use 5 machine with 32GB memory)

I know the second code is not the same as the first. In fact, my purpose is
to implement the second, but not work. So I try the first one.
I don't know whether this is related to the data(with long string) or Spark
on Yarn. But the first code works fine on the same data.

Is there any way to find out the log when spark stall in scheduler delay,
please? Or any ideas about this problem?

Thanks a lot in advance for your help.

Cheers
Gen

large scheduler delay in pyspark

Reply via email to