Hi,
I am running some experiments with OnlineLDAOptimizer in Spark 1.6.1. My
Spark cluster has 30 machines.
However, I found that the Scheduler delay at job/stage "reduce at
LDAOptimizer.scala:452" is extremely large when the LDA model is large. The
delay could be tens of seconds.
Does anyone kn
ch with 10 cores. Only application running
on cluster is mine. The overall performance is quite good. But there is
large scheduler delay especially when it reads the smaller data-frame from
hdfs and in final step when it is using the broadcast hash join.
Usually compute time is just 50% of the
if i not in a:
>> >a.append(i)
>> >return a
>> > rdd.reduceByKey(f)
>>
>> Is it possible that you have large object that is also named `i` or `a`
>> or `b`?
>>
>> Btw, the second one could be slow than first one
a.append(i)
> >return a
> > rdd.reduceByKey(f)
>
> Is it possible that you have large object that is also named `i` or `a` or
> `b`?
>
> Btw, the second one could be slow than first one, because you try to lookup
> a object in a list, that is O(N), especial
cially when the object is large (dict).
> It will cause very large scheduler delay, about 15-20 mins.(The data I deal
> with is about 300 mb, and I use 5 machine with 32GB memory)
If you see scheduler delay, it means there may be a large broadcast involved.
> I know the second code is not
ceByKey as follows:
rdd.reduceByKey(lambda a, b: a+b)
It works fine, scheduler delay is less than 10s. However if I do
reduceByKey:
def f(a, b):
for i in b:
if i not in a:
a.append(i)
return a
rdd.reduceByKey(f)
It will cause very large scheduler delay