[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16464 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r96144837 --- Diff: R/pkg/inst/tests/testthat/test_mllib_clustering.R --- @@ -146,12 +146,16 @@ test_that("spark.lda with libsvm", { topics <- stats$topicTopTerms weights <- stats$topicTopTermsWeights vocabulary <- stats$vocabulary + trainingLogLikelihood <- stats$trainingLogLikelihood + logPrior <- stats$logPrior - expect_false(isDistributed) + expect_true(isDistributed) expect_true(logLikelihood <= 0 & is.finite(logLikelihood)) expect_true(logPerplexity >= 0 & is.finite(logPerplexity)) expect_equal(vocabSize, 11) expect_true(is.null(vocabulary)) + expect_true(trainingLogLikelihood <= 0 & !is.nan(trainingLogLikelihood)) + expect_true(logPrior <= 0 & !is.nan(logPrior)) --- End diff -- shouldn't these two tests be `!is.na`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r96125283 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- @felixcheung I will update it to NA. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r96103386 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- another data point is that `NA` fits in as just another element in vector and list in R but not NULL ``` > as.data.frame(list(1, 2, 2)) X1 X2 X2.1 1 1 22 > as.data.frame(list(1, NULL, 2)) Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0 > as.data.frame(list(1, NA, 2)) X1 NA. X2 1 1 NA 2 ``` In Spark SQL, we convert NULL from JVM to NA in DataFrame --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r96049147 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- They look very similar. I am fine with any of the two. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r96049008 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- When returning NA `> stats $docConcentration [1] 0.09116572 0.13013764 0.09116557 0.09116559 0.09116584 0.09116617 [7] 0.09116576 0.09116573 0.09116569 0.09116564 $topicConcentration [1] 0.1 $logLikelihood [1] -392.9573 $logPerplexity [1] 2.976949 $isDistributed [1] FALSE $vocabSize [1] 10 $topics SparkDataFrame[topic:int, term:array, termWeights:array] $vocabulary [1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6" $trainingLogLikelihood [1] NA $logPrior [1] NA ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r96047751 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- When returning NULL, `stats $docConcentration [1] 0.09117984 0.09117993 0.09117964 0.09117969 0.09117991 0.13010149 [7] 0.09117986 0.09117983 0.09192744 0.09117969 $topicConcentration [1] 0.1 $logLikelihood [1] -395.4505 $logPerplexity [1] 2.995837 $isDistributed [1] FALSE $vocabSize [1] 10 $topics SparkDataFrame[topic:int, term:array, termWeights:array] $vocabulary [1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6" $trainingLogLikelihood NULL $logPrior NULL` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r96035584 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- My take is the convention in R for missing values in a structure (list, data.frame) is `NA`. And since we are returning a list we could also simply omit it if the value is not applicable or available. (ie. don't add `trainingLogLikelihood` to the list in L421) In any case, what does it look like when printing the `summary` returned here for these values (NULL, NA) we are proposing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r95954100 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- In MLlib, if the model is ```DistributedLDAModel```, it has variables called ```trainingLogLikelihood``` and ```logPrior```. If the model is ```LocalLDAModel```, there is no above variables exist, we can not tell users whether they are numeric. So I'd prefer ```NULL``` to represent not existing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r95951488 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- I'd prefer NA rather than NULL. If it's numeric I think NaN is appropriate, no? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r95948100 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"), vocabSize <- callJMethod(jobj, "vocabSize") topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic)) vocabulary <- callJMethod(jobj, "vocabulary") +trainingLogLikelihood <- callJMethod(jobj, "trainingLogLikelihood") +logPrior <- callJMethod(jobj, "logPrior") --- End diff -- I think it's more appropriate to return ```NULL``` rather than ```NaN``` for local LDA model, since the ```logPrior``` is not existing rather than not a number. BTW, I think we can return NULL directly according to ```isDistributed```, otherwise, call corresponding Scala methods. This should reduce the complexity of ```LDAWrapper``` and reduce communication between R and Scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r95948289 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -388,6 +388,13 @@ setMethod("spark.lda", signature(data = "SparkDataFrame"), #' \item{\code{topics}}{top 10 terms and their weights of all topics} #' \item{\code{vocabulary}}{whole terms of the training corpus, NULL if libsvm format file #' used as training set} +#' \item{\code{trainingLogLikelihood}}{Log likelihood of the observed tokens in the training set, +#' given the current parameter estimates: +#' log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) +#' It is only for \code{DistributedLDAModel} (i.e., optimizer = "em")} --- End diff -- ```\code{DistributedLDAModel}``` should convert to text description, since there is no class called ```DistributedLDAModel``` in SparkR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r95436115 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -45,6 +45,11 @@ private[r] class LDAWrapper private ( import LDAWrapper._ private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel] + private val distributedMoel = lda.isDistributed match { --- End diff -- Fixed. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r95434408 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -45,6 +45,11 @@ private[r] class LDAWrapper private ( import LDAWrapper._ private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel] + private val distributedMoel = lda.isDistributed match { --- End diff -- `distributedModel`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94984564 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -172,6 +187,8 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { model, ldaModel.logLikelihood(preprocessedData), ldaModel.logPerplexity(preprocessedData), + trainingLogLikelihood, + logPrior, --- End diff -- LogPrior is calculated based on the serialized topics etc, which are also used by the trainingLikelyhood. But the trainingLikelyhood is the same for both original and loaded model. Let me debug more. It looks like a bug. The original MLLIB implementation doesn't serialize the two parameters as they can be calculated from other saved values. In addition, there is no unit test for comparing the two values, which could be the reason of not catching this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94983934 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -123,6 +126,10 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { .setMaxIter(maxIter) .setSubsamplingRate(subsamplingRate) +if (optimizer == "em") { --- End diff -- If it is "online", it is not necessary to set the optimizer. But setting it anyway will make the code clean. I will do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94905455 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -172,6 +187,8 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { model, ldaModel.logLikelihood(preprocessedData), ldaModel.logPerplexity(preprocessedData), + trainingLogLikelihood, + logPrior, --- End diff -- that's odd, is that an issue with model persistence? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94905327 --- Diff: R/pkg/inst/tests/testthat/test_mllib.R --- @@ -907,6 +917,8 @@ test_that("spark.lda with text input", { expect_equal(logPerplexity, stats2$logPerplexity) expect_equal(vocabSize, stats2$vocabSize) expect_true(all.equal(vocabulary, stats2$vocabulary)) + expect_equal(trainingLogLikelihood, stats2$trainingLogLikelihood) + expect_equal(logPrior, stats2$logPrior) --- End diff -- can you have a test for when these are not `nan`? like here https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala#L239 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94905196 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -123,6 +126,10 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { .setMaxIter(maxIter) .setSubsamplingRate(subsamplingRate) +if (optimizer == "em") { --- End diff -- shouldn't it just set `optimizer` - without having to check if it is `em`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94866311 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -172,6 +187,8 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { model, ldaModel.logLikelihood(preprocessedData), ldaModel.logPerplexity(preprocessedData), + trainingLogLikelihood, + logPrior, --- End diff -- With the same dataset, Scala side tests: Original LogPrior:-3.3387459952856338 LogPrior from saved model: -0.9202435107654922 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94721880 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -172,6 +187,8 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { model, ldaModel.logLikelihood(preprocessedData), ldaModel.logPerplexity(preprocessedData), + trainingLogLikelihood, + logPrior, --- End diff -- The first version, I got them from the model in the `LDAWrapper` class. However, when I read `logPrior`, I found that the loaded `logPrior` is not the same as the value before save. So, I followed the logLikelihood and logPerplexity to save it in the metadata. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94716683 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -172,6 +187,8 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { model, ldaModel.logLikelihood(preprocessedData), ldaModel.logPerplexity(preprocessedData), + trainingLogLikelihood, + logPrior, --- End diff -- since model is referenced and persisted, is there a need to handle trainingLogLikelihood and logPrior separately like this, and writing to metadata, instead of just getting from the model when fetching for the summary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/16464 [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer correctly ## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark new Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16464.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16464 commit 277763614934aef2625274e2e55e7efe6eb4c2df Author: wm...@hotmail.comDate: 2017-01-03T23:26:28Z fix the optimizer bug --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org