Re: LDA and Maximum Iterations

2016-09-20 Thread Frank Zhang
Hi Yuhao,
   Thank you so much for your great contribution to the LDA and other Spark 
modules!
    I use both Spark 1.6.2 and 2.0.0. The data I used originally is very large 
which has tens of millions of documents. But for test purpose, the data set I 
mentioned earlier ("/data/mllib/sample_lda_data.txt") is good enough.  Please 
change the path to where you install your Spark to point to the data set and 
run those lines:
import org.apache.spark.mllib.clustering.LDAimport 
org.apache.spark.mllib.linalg.Vectors//please change the path for the data set 
below:
val data = sc.textFile("/data/mllib/sample_lda_data.txt") val parsedData = 
data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))val corpus = 
parsedData.zipWithIndex.map(_.swap).cache()val ldaModel = new 
LDA().setK(3).run(corpus)    It should work. After that, please run:val 
ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus)

   When I ran it, at job #90, that iteration took relatively extremely long 
then it stopped with exception:
Active Jobs (1)

| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks 
(for all stages): Succeeded/Total |
| 90 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 22 s | 0/269 | 
0/538 |


Completed Jobs (90)

| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks 
(for all stages): Succeeded/Total |
| 89 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 43 ms | 4/4 (262 
skipped) | 8/8 (524 skipped) |
| 88 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 40 ms | 4/4 (259 
skipped) | 8/8 (518 skipped) |
| 87 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 80 ms | 4/4 (256 
skipped) | 8/8 (512 skipped) |
| 86 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 41 ms | 4/4 (253 
skipped) | 8/8 (506 skipped) |

   Part of the error message:Driver stacktrace:  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)  at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)  at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)  at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)  at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)  at 
org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1046)  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)  at 
org.apache.spark.rdd.RDD.fold(RDD.scala:1040)  at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:226)
  at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:213)  
at org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:79) 
 at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:334)  ... 48 
elidedCaused by: java.lang.StackOverflowError  at 
java.lang.reflect.InvocationTargetException.(InvocationTargetException.java:72)
  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)  at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
   Thank you so much!
   Frank 


  From: "Yang, Yuhao" <yuhao.y...@intel.com>
 To: Frank Zhang <dataminin...@yahoo.com>; "user@spark.apache.org" 
<user@spark.apache.org> 
 Sent: Tuesday, September 20, 2016 9:49 AM
 Subject: RE: LDA and Maximum Iterations
  
#yiv8087534397 -- filtered {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 
4;}#yiv8087534397 filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 
1;}#yiv8087534397 filtered {panose-1:2 4 5

LDA and Maximum Iterations

2016-09-19 Thread Frank Zhang
Hi all,
   I have a question about parameter setting for LDA model. When I tried to set 
a large number like 500 for  setMaxIterations, the program always fails.  There 
is a very straightforward LDA tutorial using an example data set in the mllib 
package:http://stackoverflow.com/questions/36631991/latent-dirichlet-allocation-lda-algorithm-not-printing-results-in-spark-scala.
  The codes are here:
import org.apache.spark.mllib.clustering.LDAimport 
org.apache.spark.mllib.linalg.Vectors// Load and parse the dataval data = 
sc.textFile("/data/mllib/sample_lda_data.txt") // you might need to change the 
path for the data setval parsedData = data.map(s => 
Vectors.dense(s.trim.split(' ').map(_.toDouble)))// Index documents with unique 
IDsval corpus = parsedData.zipWithIndex.map(_.swap).cache()// Cluster the 
documents into three topics using LDAval ldaModel = new 
LDA().setK(3).run(corpus)

But if I change the last line to 
val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus), the program 
fails.  

    I greatly appreciate your help! 
Best,
    Frank