[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-06-29 Thread Asher Krim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068988#comment-16068988
 ] 

Asher Krim commented on SPARK-20797:


This looks like a duplicate of 
https://issues.apache.org/jira/browse/SPARK-19294? 

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-19 Thread d0evi1 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017128#comment-16017128
 ] 

d0evi1 commented on SPARK-20797:


ok,  there is: https://github.com/apache/spark/pull/18034

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017127#comment-16017127
 ] 

Apache Spark commented on SPARK-20797:
--

User 'd0evi1' has created a pull request for this issue:
https://github.com/apache/spark/pull/18034

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-18 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016061#comment-16016061
 ] 

yuhao yang commented on SPARK-20797:


[~d0evi1] Thanks for reporting the issue and proposal for the fix. Would you 
send a PR for the fix? 

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org