Re: large scheduler delay in pyspark

ayan guha Wed, 05 Aug 2015 14:22:17 -0700

It seems you want to dedupe your data after the merge so set(a+b) should
also work..you may ditch the list comprehensiion operation.
On 5 Aug 2015 23:55, "gen tang" <gen.tan...@gmail.com> wrote:


> Hi,
> Thanks a lot for your reply.
>
>
> It seems that it is because of the slowness of the second code.
> I rewrite code as list(set([i.items for i in a] + [i.items for i in b])).
> The program returns normal.
>
> By the way, I find that when the computation is running, UI will show
> scheduler delay. However, it is not scheduler delay. When computation
> finishes, UI will show correct scheduler delay time.
>
> Cheers
> Gen
>
>
> On Tue, Aug 4, 2015 at 3:13 PM, Davies Liu <dav...@databricks.com> wrote:
>
>> On Mon, Aug 3, 2015 at 9:00 AM, gen tang <gen.tan...@gmail.com> wrote:
>> > Hi,
>> >
>> > Recently, I met some problems about scheduler delay in pyspark. I worked
>> > several days on this problem, but not success. Therefore, I come to
>> here to
>> > ask for help.
>> >
>> > I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
>> merge
>> > value by "adding" two list
>> >
>> > if I do reduceByKey as follows:
>> >    rdd.reduceByKey(lambda a, b: a+b)
>> > It works fine, scheduler delay is less than 10s. However if I do
>> > reduceByKey:
>> >    def f(a, b):
>> >        for i in b:
>> >             if i not in a:
>> >                a.append(i)
>> >        return a
>> >   rdd.reduceByKey(f)
>>
>> Is it possible that you have large object that is also named `i` or `a`
>> or `b`?
>>
>> Btw, the second one could be slow than first one, because you try to
>> lookup
>> a object in a list, that is O(N), especially when the object is large
>> (dict).
>>
>> > It will cause very large scheduler delay, about 15-20 mins.(The data I
>> deal
>> > with is about 300 mb, and I use 5 machine with 32GB memory)
>>
>> If you see scheduler delay, it means there may be a large broadcast
>> involved.
>>
>> > I know the second code is not the same as the first. In fact, my
>> purpose is
>> > to implement the second, but not work. So I try the first one.
>> > I don't know whether this is related to the data(with long string) or
>> Spark
>> > on Yarn. But the first code works fine on the same data.
>> >
>> > Is there any way to find out the log when spark stall in scheduler
>> delay,
>> > please? Or any ideas about this problem?
>> >
>> > Thanks a lot in advance for your help.
>> >
>> > Cheers
>> > Gen
>> >
>> >
>>
>
>

Re: large scheduler delay in pyspark

Reply via email to