large scheduler delay in OnlineLDAOptimizer, (MLlib and LDA)

2016-10-27 Thread Xiaoye Sun
Hi, I am running some experiments with OnlineLDAOptimizer in Spark 1.6.1. My Spark cluster has 30 machines. However, I found that the Scheduler delay at job/stage "reduce at LDAOptimizer.scala:452" is extremely large when the LDA model is large. The delay could be tens of seconds. Does anyone kn

large scheduler delay

2016-04-18 Thread Darshan Singh
ch with 10 cores. Only application running on cluster is mine. The overall performance is quite good. But there is large scheduler delay especially when it reads the smaller data-frame from hdfs and in final step when it is using the broadcast hash join. Usually compute time is just 50% of the

Re: large scheduler delay in pyspark

2015-08-05 Thread ayan guha
if i not in a: >> >a.append(i) >> >return a >> > rdd.reduceByKey(f) >> >> Is it possible that you have large object that is also named `i` or `a` >> or `b`? >> >> Btw, the second one could be slow than first one

Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
a.append(i) > >return a > > rdd.reduceByKey(f) > > Is it possible that you have large object that is also named `i` or `a` or > `b`? > > Btw, the second one could be slow than first one, because you try to lookup > a object in a list, that is O(N), especial

Re: large scheduler delay in pyspark

2015-08-04 Thread Davies Liu
cially when the object is large (dict). > It will cause very large scheduler delay, about 15-20 mins.(The data I deal > with is about 300 mb, and I use 5 machine with 32GB memory) If you see scheduler delay, it means there may be a large broadcast involved. > I know the second code is not

large scheduler delay in pyspark

2015-08-03 Thread gen tang
ceByKey as follows: rdd.reduceByKey(lambda a, b: a+b) It works fine, scheduler delay is less than 10s. However if I do reduceByKey: def f(a, b): for i in b: if i not in a: a.append(i) return a rdd.reduceByKey(f) It will cause very large scheduler delay