[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16464


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96144837
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib_clustering.R ---
@@ -146,12 +146,16 @@ test_that("spark.lda with libsvm", {
   topics <- stats$topicTopTerms
   weights <- stats$topicTopTermsWeights
   vocabulary <- stats$vocabulary
+  trainingLogLikelihood <- stats$trainingLogLikelihood
+  logPrior <- stats$logPrior
 
-  expect_false(isDistributed)
+  expect_true(isDistributed)
   expect_true(logLikelihood <= 0 & is.finite(logLikelihood))
   expect_true(logPerplexity >= 0 & is.finite(logPerplexity))
   expect_equal(vocabSize, 11)
   expect_true(is.null(vocabulary))
+  expect_true(trainingLogLikelihood <= 0 & !is.nan(trainingLogLikelihood))
+  expect_true(logPrior <= 0 & !is.nan(logPrior))
--- End diff --

shouldn't these two tests be `!is.na`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-14 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96125283
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

@felixcheung I will update it to NA. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96103386
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

another data point is that `NA` fits in as just another element in vector 
and list in R but not NULL
```
> as.data.frame(list(1, 2, 2))
  X1 X2 X2.1
1  1  22
> as.data.frame(list(1, NULL, 2))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names 
= TRUE,  :
  arguments imply differing number of rows: 1, 0
> as.data.frame(list(1, NA, 2))
  X1 NA. X2
1  1  NA  2
```

In Spark SQL, we convert NULL from JVM to NA in DataFrame


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96049147
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

They look very similar. I am fine with any of the two. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96049008
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

When returning NA
`> stats
$docConcentration
 [1] 0.09116572 0.13013764 0.09116557 0.09116559 0.09116584 0.09116617
 [7] 0.09116576 0.09116573 0.09116569 0.09116564

$topicConcentration
[1] 0.1

$logLikelihood
[1] -392.9573

$logPerplexity
[1] 2.976949

$isDistributed
[1] FALSE

$vocabSize
[1] 10

$topics
SparkDataFrame[topic:int, term:array, termWeights:array]

$vocabulary
 [1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"

$trainingLogLikelihood
[1] NA

$logPrior
[1] NA
`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96047751
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

When returning NULL,
`stats
$docConcentration
 [1] 0.09117984 0.09117993 0.09117964 0.09117969 0.09117991 0.13010149
 [7] 0.09117986 0.09117983 0.09192744 0.09117969

$topicConcentration
[1] 0.1

$logLikelihood
[1] -395.4505

$logPerplexity
[1] 2.995837

$isDistributed
[1] FALSE

$vocabSize
[1] 10

$topics
SparkDataFrame[topic:int, term:array, termWeights:array]

$vocabulary
 [1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"

$trainingLogLikelihood
NULL

$logPrior
NULL`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96035584
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

My take is the convention in R for missing values in a structure (list, 
data.frame) is `NA`.
And since we are returning a list we could also simply omit it if the value 
is not applicable or available. (ie. don't add `trainingLogLikelihood` to the 
list in L421)

In any case, what does it look like when printing the `summary` returned 
here for these values (NULL, NA) we are proposing?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r95954100
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

In MLlib, if the model is ```DistributedLDAModel```, it has variables 
called ```trainingLogLikelihood``` and ```logPrior```. If the model is 
```LocalLDAModel```, there is no above variables exist, we can not tell users 
whether they are numeric. So I'd prefer ```NULL``` to represent not existing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r95951488
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

I'd prefer NA rather than NULL.
If it's numeric I think NaN is appropriate, no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-12 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r95948100
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

I think it's more appropriate to return ```NULL``` rather than ```NaN``` 
for local LDA model, since the ```logPrior``` is not existing rather than not a 
number.
BTW, I think we can return NULL directly according to ```isDistributed```, 
otherwise, call corresponding Scala methods. This should reduce the complexity 
of ```LDAWrapper``` and reduce communication between R and Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-12 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r95948289
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -388,6 +388,13 @@ setMethod("spark.lda", signature(data = 
"SparkDataFrame"),
 #' \item{\code{topics}}{top 10 terms and their weights of all 
topics}
 #' \item{\code{vocabulary}}{whole terms of the training corpus, 
NULL if libsvm format file
 #'   used as training set}
+#' \item{\code{trainingLogLikelihood}}{Log likelihood of the 
observed tokens in the training set,
+#'   given the current parameter estimates:
+#'   log P(docs | topics, topic distributions for docs, 
Dirichlet hyperparameters)
+#'   It is only for \code{DistributedLDAModel} (i.e., 
optimizer = "em")}
--- End diff --

```\code{DistributedLDAModel}``` should convert to text description, since 
there is no class called ```DistributedLDAModel``` in SparkR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-10 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r95436115
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -45,6 +45,11 @@ private[r] class LDAWrapper private (
   import LDAWrapper._
 
   private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
+  private val distributedMoel = lda.isDistributed match {
--- End diff --

Fixed. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r95434408
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -45,6 +45,11 @@ private[r] class LDAWrapper private (
   import LDAWrapper._
 
   private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
+  private val distributedMoel = lda.isDistributed match {
--- End diff --

`distributedModel`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-06 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94984564
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -172,6 +187,8 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   model,
   ldaModel.logLikelihood(preprocessedData),
   ldaModel.logPerplexity(preprocessedData),
+  trainingLogLikelihood,
+  logPrior,
--- End diff --

LogPrior is calculated based on the serialized topics etc, which are also 
used by the trainingLikelyhood. But the trainingLikelyhood is the same for both 
original and loaded model. Let me debug more. It looks like a bug. The original 
MLLIB implementation doesn't serialize the two parameters as they can be 
calculated from other saved values. In addition, there is no unit test for 
comparing the two values, which could be the reason of not catching this issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-06 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94983934
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -123,6 +126,10 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   .setMaxIter(maxIter)
   .setSubsamplingRate(subsamplingRate)
 
+if (optimizer == "em") {
--- End diff --

If it is "online", it is not necessary to set the optimizer. But setting it 
anyway will make the code clean. I will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94905455
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -172,6 +187,8 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   model,
   ldaModel.logLikelihood(preprocessedData),
   ldaModel.logPerplexity(preprocessedData),
+  trainingLogLikelihood,
+  logPrior,
--- End diff --

that's odd, is that an issue with model persistence?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94905327
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -907,6 +917,8 @@ test_that("spark.lda with text input", {
   expect_equal(logPerplexity, stats2$logPerplexity)
   expect_equal(vocabSize, stats2$vocabSize)
   expect_true(all.equal(vocabulary, stats2$vocabulary))
+  expect_equal(trainingLogLikelihood, stats2$trainingLogLikelihood)
+  expect_equal(logPrior, stats2$logPrior)
--- End diff --

can you have a test for when these are not `nan`? like here 
https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala#L239


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-05 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94905196
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -123,6 +126,10 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   .setMaxIter(maxIter)
   .setSubsamplingRate(subsamplingRate)
 
+if (optimizer == "em") {
--- End diff --

shouldn't it just set `optimizer` - without having to check if it is `em`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94866311
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -172,6 +187,8 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   model,
   ldaModel.logLikelihood(preprocessedData),
   ldaModel.logPerplexity(preprocessedData),
+  trainingLogLikelihood,
+  logPrior,
--- End diff --

With the same dataset, Scala side tests:
Original LogPrior:-3.3387459952856338
LogPrior from saved model: -0.9202435107654922



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-04 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94721880
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -172,6 +187,8 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   model,
   ldaModel.logLikelihood(preprocessedData),
   ldaModel.logPerplexity(preprocessedData),
+  trainingLogLikelihood,
+  logPrior,
--- End diff --

The first version, I got them from the model in the `LDAWrapper` class. 
However, when I read `logPrior`, I found that the loaded `logPrior` is not the 
same as the value before save. So, I followed the logLikelihood and 
logPerplexity to save it in the metadata. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94716683
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -172,6 +187,8 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   model,
   ldaModel.logLikelihood(preprocessedData),
   ldaModel.logPerplexity(preprocessedData),
+  trainingLogLikelihood,
+  logPrior,
--- End diff --

since model is referenced and persisted, is there a need to handle 
trainingLogLikelihood and logPrior separately like this, and writing to 
metadata, instead of just getting from the model when fetching for the summary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-03 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/16464

[SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer correctly

## What changes were proposed in this pull request?

spark.lda passes the optimizer "em" or "online" as a string to the backend. 
However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, 
for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE 
based on scala code.

In addition, the `summary` method should bring back the results related to 
`DistributedLDAModel`.

## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add 
necessary tests for `summary` method.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark new

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16464.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16464


commit 277763614934aef2625274e2e55e7efe6eb4c2df
Author: wm...@hotmail.com 
Date:   2017-01-03T23:26:28Z

fix the optimizer bug




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org