Hi Yuhao,
Thank you so much for your great contribution to the LDA and other Spark
modules!
I use both Spark 1.6.2 and 2.0.0. The data I used originally is very large
which has tens of millions of documents. But for test purpose, the data set I
mentioned earlier ("/data/mllib/sample_lda_data.txt") is good enough. Please
change the path to where you install your Spark to point to the data set and
run those lines:
import org.apache.spark.mllib.clustering.LDAimport
org.apache.spark.mllib.linalg.Vectors//please change the path for the data set
below:
val data = sc.textFile("/data/mllib/sample_lda_data.txt") val parsedData =
data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))val corpus =
parsedData.zipWithIndex.map(_.swap).cache()val ldaModel = new
LDA().setK(3).run(corpus) It should work. After that, please run:val
ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus)
When I ran it, at job #90, that iteration took relatively extremely long
then it stopped with exception:
Active Jobs (1)
| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks
(for all stages): Succeeded/Total |
| 90 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 22 s | 0/269 |
0/538 |
Completed Jobs (90)
| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks
(for all stages): Succeeded/Total |
| 89 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 43 ms | 4/4 (262
skipped) | 8/8 (524 skipped) |
| 88 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 40 ms | 4/4 (259
skipped) | 8/8 (518 skipped) |
| 87 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 80 ms | 4/4 (256
skipped) | 8/8 (512 skipped) |
| 86 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 41 ms | 4/4 (253
skipped) | 8/8 (506 skipped) |
Part of the error message:Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1934) at
org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1046) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at
org.apache.spark.rdd.RDD.fold(RDD.scala:1040) at
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:226)
at
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:213)
at org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:79)
at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:334) ... 48
elidedCaused by: java.lang.StackOverflowError at
java.lang.reflect.InvocationTargetException.(InvocationTargetException.java:72)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
Thank you so much!
Frank
From: "Yang, Yuhao" <yuhao.y...@intel.com>
To: Frank Zhang <dataminin...@yahoo.com>; "user@spark.apache.org"
<user@spark.apache.org>
Sent: Tuesday, September 20, 2016 9:49 AM
Subject: RE: LDA and Maximum Iterations
#yiv8087534397 -- filtered {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2
4;}#yiv8087534397 filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1
1;}#yiv8087534397 filtered {panose-1:2 4 5