Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help.
I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge value by "adding" two list if I do reduceByKey as follows: rdd.reduceByKey(lambda a, b: a+b) It works fine, scheduler delay is less than 10s. However if I do reduceByKey: def f(a, b): for i in b: if i not in a: a.append(i) return a rdd.reduceByKey(f) It will cause very large scheduler delay, about 15-20 mins.(The data I deal with is about 300 mb, and I use 5 machine with 32GB memory) I know the second code is not the same as the first. In fact, my purpose is to implement the second, but not work. So I try the first one. I don't know whether this is related to the data(with long string) or Spark on Yarn. But the first code works fine on the same data. Is there any way to find out the log when spark stall in scheduler delay, please? Or any ideas about this problem? Thanks a lot in advance for your help. Cheers Gen