Hi Yuhao,
   Thank you so much for your great contribution to the LDA and other Spark 
modules!
    I use both Spark 1.6.2 and 2.0.0. The data I used originally is very large 
which has tens of millions of documents. But for test purpose, the data set I 
mentioned earlier ("/data/mllib/sample_lda_data.txt") is good enough.  Please 
change the path to where you install your Spark to point to the data set and 
run those lines:
import org.apache.spark.mllib.clustering.LDAimport 
org.apache.spark.mllib.linalg.Vectors//please change the path for the data set 
below:
val data = sc.textFile("/data/mllib/sample_lda_data.txt") val parsedData = 
data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))val corpus = 
parsedData.zipWithIndex.map(_.swap).cache()val ldaModel = new 
LDA().setK(3).run(corpus)    It should work. After that, please run:val 
ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus)

   When I ran it, at job #90, that iteration took relatively extremely long 
then it stopped with exception:
Active Jobs (1)

| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks 
(for all stages): Succeeded/Total |
| 90 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 22 s | 0/269 | 
0/538 |


Completed Jobs (90)

| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks 
(for all stages): Succeeded/Total |
| 89 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 43 ms | 4/4 (262 
skipped) | 8/8 (524 skipped) |
| 88 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 40 ms | 4/4 (259 
skipped) | 8/8 (518 skipped) |
| 87 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 80 ms | 4/4 (256 
skipped) | 8/8 (512 skipped) |
| 86 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 41 ms | 4/4 (253 
skipped) | 8/8 (506 skipped) |

   Part of the error message:Driver stacktrace:  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)  at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)  at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)  at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)  at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)  at 
org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1046)  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)  at 
org.apache.spark.rdd.RDD.fold(RDD.scala:1040)  at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:226)
  at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:213)  
at org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:79) 
 at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:334)  ... 48 
elidedCaused by: java.lang.StackOverflowError  at 
java.lang.reflect.InvocationTargetException.<init>(InvocationTargetException.java:72)
  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)  at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
   Thank you so much!
   Frank 


      From: "Yang, Yuhao" <yuhao.y...@intel.com>
 To: Frank Zhang <dataminin...@yahoo.com>; "user@spark.apache.org" 
<user@spark.apache.org> 
 Sent: Tuesday, September 20, 2016 9:49 AM
 Subject: RE: LDA and Maximum Iterations
  
#yiv8087534397 -- filtered {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 
4;}#yiv8087534397 filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 
1;}#yiv8087534397 filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv8087534397 
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv8087534397 
filtered {panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv8087534397 
p.yiv8087534397MsoNormal, #yiv8087534397 li.yiv8087534397MsoNormal, 
#yiv8087534397 div.yiv8087534397MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8087534397 a:link, 
#yiv8087534397 span.yiv8087534397MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv8087534397 a:visited, 
#yiv8087534397 span.yiv8087534397MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv8087534397 
span.yiv8087534397EmailStyle17 {color:#1F497D;}#yiv8087534397 
.yiv8087534397MsoChpDefault {font-size:10.0pt;}#yiv8087534397 filtered 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv8087534397 div.yiv8087534397WordSection1 
{}#yiv8087534397 Hi Frank,    Which version of Spark are you using? Also can 
you share more information about the exception.    If it’s not confidential, 
you can send the data sample to me (yuhao.y...@intel.com) and I can try to 
investigate.    Regards, Yuhao    From: Frank Zhang 
[mailto:dataminin...@yahoo.com.INVALID]
Sent: Monday, September 19, 2016 9:20 PM
To: user@spark.apache.org
Subject: LDA and Maximum Iterations    Hi all,       I have a question about 
parameter setting for LDA model. When I tried to set a large number like 500 
for   setMaxIterations, the program always fails.  There is a very 
straightforward LDA tutorial using an example data set in the mllib 
package:http://stackoverflow.com/questions/36631991/latent-dirichlet-allocation-lda-algorithm-not-printing-results-in-spark-scala.
  The codes are here:    import org.apache.spark.mllib.clustering.LDA import 
org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = 
sc.textFile("/data/mllib/sample_lda_data.txt") // you might need to change the 
path for the data set val parsedData = data.map(s => 
Vectors.dense(s.trim.split(' ').map(_.toDouble))) // Index documents with 
unique IDs val corpus = parsedData.zipWithIndex.map(_.swap).cache() // Cluster 
the documents into three topics using LDA val ldaModel = new 
LDA().setK(3).run(corpus)    But if I change the last line to  val ldaModel = 
new LDA().setK(3).setMaxIterations(500).run(corpus), the program fails.         
 I greatly appreciate your help!     Best,        Frank          

   

Reply via email to