from:"wangmiao1981"

[GitHub] spark issue #12675: [SPARK-14894][PySpark] Add result summary api to Gaussia...

2017-02-09 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/12675
  
@GayathriMurali If you are not able to proceed, I can take over. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-09 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16800
  
close to trigger windows test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-09 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16800
  
Open to trigger


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-09 Thread wangmiao1981

GitHub user wangmiao1981 reopened a pull request:

https://github.com/apache/spark/pull/16800

[SPARK-19456][SparkR]:Add LinearSVC R API

## What changes were proposed in this pull request?

Linear SVM classifier is newly added into ML and python API has been added. 
This JIRA is to add R side API.

Marked as WIP, as I am designing unit tests.

## How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark svc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16800.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16800


commit 90b5203d0048879108f6b637ec182f2ef29e769b
Author: wm...@hotmail.com 
Date:   2017-02-03T22:03:12Z

initial check in

commit a4dceec9debb14eae5c7040c58dd7988aa9a67c5
Author: wm...@hotmail.com 
Date:   2017-02-04T01:21:07Z

start test

commit 5b4c406c96a0e5835430af2368ab627ec0c3089d
Author: wm...@hotmail.com 
Date:   2017-02-04T01:29:18Z

format fix

commit e6eea1df1acd3d01ea3c989a6cfe0823a4f99fb2
Author: wm...@hotmail.com 
Date:   2017-02-07T06:56:48Z

add unit test

commit 35e669e28eeda175cd064a8021725d28d7f79fa7
Author: wm...@hotmail.com 
Date:   2017-02-07T18:45:52Z

fix warning

commit e6b6df81f2e40b4d7871e835f67e2663c9ae13f4
Author: wm...@hotmail.com 
Date:   2017-02-08T22:42:16Z

rename to spark.svmLinear

commit a1801261335cf08fc87f17d76979fda6fdcc1969
Author: wm...@hotmail.com 
Date:   2017-02-08T22:43:47Z

namespace rename

commit bbc72e1b02d6b1125cd125986beb5236a30b9d7d
Author: wm...@hotmail.com 
Date:   2017-02-09T00:01:51Z

fix unit test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-09 Thread wangmiao1981

Github user wangmiao1981 closed the pull request at:

https://github.com/apache/spark/pull/16800


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-11 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16800
  
@felixcheung I have addressed the comments. cc @yanboliang  @hhbyyh  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-12 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100710324
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -39,6 +46,116 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
+#' linear SVM Model
+#'
+#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training.
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param regParam The regularization parameter.
+#' @param maxIter Maximum iteration number.
+#' @param tol Convergence tolerance of iterations.
+#' @param standardization Whether to standardize the training features 
before fitting the model. The coefficients
+#'of models will be always returned on the 
original scale, so it will be transparent for
+#'users. Note that with/without standardization, 
the models should be always converged
+#'to the same solution when no regularization is 
applied. Default is TRUE, same as glmnet.
+#' @param threshold The threshold in binary classification, in range [0, 
1].
+#' @param weightCol The weight column name.
+#' @param ... additional arguments passed to the method.
+#' @return \code{spark.svmLinear} returns a fitted linear SVM model.
+#' @rdname spark.svmLinear
+#' @aliases spark.svmLinear,SparkDataFrame,formula-method
+#' @name spark.svmLinear
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' df <- createDataFrame(iris)
+#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
+#' model <- spark.svmLinear(training, Species ~ ., regParam = 0.5)
+#' summary <- summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, training)
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and predict
+#' # Note that summary deos not work on loaded model
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.svmLinear since 2.2.0
+setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, regParam = 0.0, maxIter = 100, tol = 
1E-6, standardization = TRUE,
+   threshold = 0.5, weightCol = NULL) {
--- End diff --

Yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-12 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100710345
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -39,6 +46,116 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
+#' linear SVM Model
+#'
+#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training.
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param regParam The regularization parameter.
+#' @param maxIter Maximum iteration number.
+#' @param tol Convergence tolerance of iterations.
+#' @param standardization Whether to standardize the training features 
before fitting the model. The coefficients
+#'of models will be always returned on the 
original scale, so it will be transparent for
+#'users. Note that with/without standardization, 
the models should be always converged
+#'to the same solution when no regularization is 
applied. Default is TRUE, same as glmnet.
+#' @param threshold The threshold in binary classification, in range [0, 
1].
+#' @param weightCol The weight column name.
+#' @param ... additional arguments passed to the method.
--- End diff --

In other algorithms, we don't add `aggregationDepth`. Shall I add them in 
another PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-12 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100710369
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -39,6 +46,116 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
+#' linear SVM Model
+#'
+#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training.
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param regParam The regularization parameter.
+#' @param maxIter Maximum iteration number.
+#' @param tol Convergence tolerance of iterations.
+#' @param standardization Whether to standardize the training features 
before fitting the model. The coefficients
+#'of models will be always returned on the 
original scale, so it will be transparent for
+#'users. Note that with/without standardization, 
the models should be always converged
+#'to the same solution when no regularization is 
applied. Default is TRUE, same as glmnet.
--- End diff --

Let me check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-12 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100710403
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -39,6 +46,116 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
+#' linear SVM Model
+#'
+#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training.
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param regParam The regularization parameter.
+#' @param maxIter Maximum iteration number.
+#' @param tol Convergence tolerance of iterations.
+#' @param standardization Whether to standardize the training features 
before fitting the model. The coefficients
+#'of models will be always returned on the 
original scale, so it will be transparent for
+#'users. Note that with/without standardization, 
the models should be always converged
+#'to the same solution when no regularization is 
applied. Default is TRUE, same as glmnet.
+#' @param threshold The threshold in binary classification, in range [0, 
1].
+#' @param weightCol The weight column name.
+#' @param ... additional arguments passed to the method.
+#' @return \code{spark.svmLinear} returns a fitted linear SVM model.
+#' @rdname spark.svmLinear
+#' @aliases spark.svmLinear,SparkDataFrame,formula-method
+#' @name spark.svmLinear
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' df <- createDataFrame(iris)
+#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
+#' model <- spark.svmLinear(training, Species ~ ., regParam = 0.5)
+#' summary <- summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, training)
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and predict
+#' # Note that summary deos not work on loaded model
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.svmLinear since 2.2.0
+setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, regParam = 0.0, maxIter = 100, tol = 
1E-6, standardization = TRUE,
+   threshold = 0.5, weightCol = NULL) {
+formula <- paste(deparse(formula), collapse = "")
+
+if (is.null(weightCol)) {
+  weightCol <- ""
--- End diff --

I think null is better. There are several places like this. Let me double 
check. Then, I will fix them all in another PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-12 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100710453
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LinearSVCWrapper.scala 
---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.classification.{LinearSVC, LinearSVCModel}
+import org.apache.spark.ml.feature.{IndexToString, RFormula}
+import org.apache.spark.ml.r.RWrapperUtils._
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class LinearSVCWrapper private (
+val pipeline: PipelineModel,
+val features: Array[String],
+val labels: Array[String]) extends MLWritable {
+  import LinearSVCWrapper._
+
+  private val svcModel: LinearSVCModel =
+pipeline.stages(1).asInstanceOf[LinearSVCModel]
+
+  lazy val coefficients: Array[Double] = svcModel.coefficients.toArray
--- End diff --

I think this one different. We just want to return it as an Array (list in 
R). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-12 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100710481
  
--- Diff: R/pkg/R/mllib_utils.R ---
@@ -35,7 +35,8 @@
 #' @seealso \link{spark.als}, \link{spark.bisectingKmeans}, 
\link{spark.gaussianMixture},
 #' @seealso \link{spark.gbt}, \link{spark.glm}, \link{glm}, 
\link{spark.isoreg},
 #' @seealso \link{spark.kmeans},
-#' @seealso \link{spark.lda}, \link{spark.logit}, \link{spark.mlp}, 
\link{spark.naiveBayes},
+#' @seealso \link{spark.lda}, \link{spark.logit}, \link{spark.svmLinear},
--- End diff --

OK. I will sort it. Before, it is named as `linearSvc` and I use editor to 
replace them automatically. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-13 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100929519
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -39,6 +46,116 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
+#' linear SVM Model
+#'
+#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training.
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param regParam The regularization parameter.
+#' @param maxIter Maximum iteration number.
+#' @param tol Convergence tolerance of iterations.
+#' @param standardization Whether to standardize the training features 
before fitting the model. The coefficients
+#'of models will be always returned on the 
original scale, so it will be transparent for
+#'users. Note that with/without standardization, 
the models should be always converged
+#'to the same solution when no regularization is 
applied. Default is TRUE, same as glmnet.
--- End diff --

glmnet help message: standardize: Logical flag for x variable 
standardization, prior to
  fitting the model sequence. The coefficients are always
  returned on the original scale. Default is
  âstandardize=TRUEâ. 
I think they are the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-14 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100976153
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -39,6 +46,116 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
+#' linear SVM Model
+#'
+#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training.
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param regParam The regularization parameter.
+#' @param maxIter Maximum iteration number.
+#' @param tol Convergence tolerance of iterations.
+#' @param standardization Whether to standardize the training features 
before fitting the model. The coefficients
+#'of models will be always returned on the 
original scale, so it will be transparent for
+#'users. Note that with/without standardization, 
the models should be always converged
+#'to the same solution when no regularization is 
applied. Default is TRUE, same as glmnet.
--- End diff --

I got your point now. I removed the unnecessary document. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-14 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r100976207
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LinearSVCWrapper.scala 
---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.classification.{LinearSVC, LinearSVCModel}
+import org.apache.spark.ml.feature.{IndexToString, RFormula}
+import org.apache.spark.ml.r.RWrapperUtils._
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class LinearSVCWrapper private (
+val pipeline: PipelineModel,
+val features: Array[String],
+val labels: Array[String]) extends MLWritable {
+  import LinearSVCWrapper._
+
+  private val svcModel: LinearSVCModel =
+pipeline.stages(1).asInstanceOf[LinearSVCModel]
+
+  lazy val coefficients: Array[Double] = svcModel.coefficients.toArray
--- End diff --

I have made the same change as 19395.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-14 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r101176118
  
--- Diff: R/pkg/R/generics.R ---
@@ -1380,6 +1380,10 @@ setGeneric("spark.kstest", function(data, ...) { 
standardGeneric("spark.kstest")
 #' @export
 setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") 
})
 
+#' @rdname spark.svmLinear
+#' @export
+setGeneric("spark.svmLinear", function(data, formula, ...) { 
standardGeneric("spark.svmLinear") })
--- End diff --

Fixed. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-14 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16800#discussion_r101176046
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -39,6 +46,131 @@ setClass("MultilayerPerceptronClassificationModel", 
representation(jobj = "jobj"
 #' @note NaiveBayesModel since 2.0.0
 setClass("NaiveBayesModel", representation(jobj = "jobj"))
 
+#' linear SVM Model
+#'
+#' Fits an linear SVM model against a SparkDataFrame. It is a binary 
classifier, similar to svm in glmnet package
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training.
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param regParam The regularization parameter.
+#' @param maxIter Maximum iteration number.
+#' @param tol Convergence tolerance of iterations.
+#' @param standardization Whether to standardize the training features 
before fitting the model. The coefficients
+#'of models will be always returned on the 
original scale, so it will be transparent for
+#'users. Note that with/without standardization, 
the models should be always converged
+#'to the same solution when no regularization is 
applied.
+#' @param threshold The threshold in binary classification, in range [0, 
1].
+#' @param weightCol The weight column name.
+#' @param aggregationDepth The depth for treeAggregate (greater than or 
equal to 2). If the dimensions of features
+#' or the number of partitions are large, this 
param could be adjusted to a larger size.
+#' This is an expert parameter. Default value 
should be good for most cases.
+#' @param ... additional arguments passed to the method.
+#' @return \code{spark.svmLinear} returns a fitted linear SVM model.
+#' @rdname spark.svmLinear
+#' @aliases spark.svmLinear,SparkDataFrame,formula-method
+#' @name spark.svmLinear
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' df <- createDataFrame(iris)
+#' training <- df[df$Species %in% c("versicolor", "virginica"), ]
+#' model <- spark.svmLinear(training, Species ~ ., regParam = 0.5)
+#' summary <- summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, training)
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and predict
+#' # Note that summary deos not work on loaded model
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.svmLinear since 2.2.0
+setMethod("spark.svmLinear", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, regParam = 0.0, maxIter = 100, tol = 
1E-6, standardization = TRUE,
+   threshold = 0.5, weightCol = NULL, aggregationDepth = 
2) {
--- End diff --

Sorry for miss this one. I fix it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12675: [SPARK-14894][PySpark] Add result summary api to Gaussia...

2017-02-14 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/12675
  
#15777 has resolved this issue. We should close this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12675: [SPARK-14894][PySpark] Add result summary api to Gaussia...

2017-02-14 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/12675
  
@HyukjinKwon @srowen This should be closed. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-15 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16800
  
@felixcheung I will do the example and vignettes today. For the document, I 
will wait for @hhbyyh to merge his main document first. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16761: [BackPort-2.1][SPARK-19319][SparkR]:SparkR Kmeans...

2017-02-15 Thread wangmiao1981

Github user wangmiao1981 closed the pull request at:

https://github.com/apache/spark/pull/16761


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16945: [SPARK-19616][SparkR]:weightCol and aggregationDe...

2017-02-15 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/16945

[SPARK-19616][SparkR]:weightCol and aggregationDepth should be improved for 
some SparkR APIs 

## What changes were proposed in this pull request?

This is a follow-up PR of #16800 

When doing SPARK-19456, we found that "" should be consider a NULL column 
name and should not be set. aggregationDepth should be exposed as an expert 
parameter.

## How was this patch tested?
Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark svc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16945.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16945






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16945: [SPARK-19616][SparkR]:weightCol and aggregationDepth sho...

2017-02-15 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16945
  
cc @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-02-15 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/13440
  
@erikerlandson Are you still working on this PR? Thanks! Miao


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-02-16 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/13440
  
@erikerlandson I am just helping clearing the stale PRs. :) I have no idea 
whether they have intention to accept it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16945: [SPARK-19616][SparkR]:weightCol and aggregationDepth sho...

2017-02-16 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16945
  
I will add tests. Now I am looking for dataset other than iris to be used 
in the document.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16969: [SPARK-19639][SPARKR][Example]:Add spark.svmLinea...

2017-02-16 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/16969

[SPARK-19639][SPARKR][Example]:Add spark.svmLinear example and update 
vignettes

## What changes were proposed in this pull request?

We recently add the spark.svmLinear API for SparkR. We need to add an 
example and update the vignettes.

## How was this patch tested?

Manually run example.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark example

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16969.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16969


commit 6e54a4d0c2f05cb017b3ce4ce105a723d03a9306
Author: wm...@hotmail.com 
Date:   2017-02-17T00:07:39Z

add example and vignnettes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-16 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@thunterdb Thanks for your review! I will address the comments soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16945: [SPARK-19616][SparkR]:weightCol and aggregationDepth sho...

2017-02-16 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16945
  
I add a test of weightCol for spark.logit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-17 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101870770
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
--- End diff --

In the MLLIB implementation, the clustering result is `case class 
Assignment(id: Long, cluster: Int)`. `idCol` is the node id and the `cluster` 
is the cluster id that this node belongs to. I think `id` is still useful to 
represent the node. Otherwise, we need to make sure that the output order of 
nodes is the same as input order, which means `id` is implicitly inferred from 
the Row Number.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-17 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r101871069
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the 
abstract: PIC finds a very
+ * low-dimensional embedding of a dataset using truncated power iteration 
on a normalized pair-wise
+ * similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is an
+ * expensive operation, because it uses PIC algorithm to cluster the whole 
input dataset.
+ *
+ * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral 
clustering (Wikipedia)]]
+ */
+@Since("2.2.0")
+@Experimental
+class PowerIterationClustering private[clustering] (
+

[GitHub] spark issue #16945: [SPARK-19616][SparkR]:weightCol and aggregationDepth sho...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16945
  
@felixcheung I have made suggested changes. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r102308292
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * [[http://www.icml2010.org/papers/387.pdf Lin and Cohen]]. From the 
abstract: PIC finds a very
+ * low-dimensional embedding of a dataset using truncated power iteration 
on a normalized pair-wise
+ * similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is an
+ * expensive operation, because it uses PIC algorithm to cluster the whole 
input dataset.
+ *
+ * @see [[http://en.wikipedia.org/wiki/Spectral_clustering Spectral 
clustering (Wikipedia)]]
+ */
+@Since("2.2.0")
+@Experimental
+class PowerIterationClustering private[clustering] (
+

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r102330772
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.{Vector}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.{col}
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
+"Supported options: 'random' and 'degree'.",
+(value: String) => validateInitMode(value))
+
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case "random" => true
+  case "degree" => true
+  case _ => false
+}
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "idCol", "column name for ids.")
--- End diff --

PIC is different compared with K-Means. K-Means `transform` applies 
`predict` method to each row of the input (i.e., each data point). While PIC 
`run` is applying the K-Means to the pseudo-eigenvector from `powerIter` 
method. This is not one-to-one map from Input dataset to result. Please also 
see the comments below.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r102337526
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala
 ---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import scala.collection.mutable
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
+
+class PowerIterationClusteringSuite extends SparkFunSuite
+  with MLlibTestSparkContext with DefaultReadWriteTest {
+
+  @transient var data: Dataset[_] = _
+  final val r1 = 1.0
+  final val n1 = 10
+  final val r2 = 4.0
+  final val n2 = 40
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+data = PowerIterationClusteringSuite.generatePICData(spark, r1, r2, 
n1, n2)
+  }
+
+  test("default parameters") {
+val pic = new PowerIterationClustering()
+
+assert(pic.getK === 2)
+assert(pic.getMaxIter === 20)
+assert(pic.getInitMode === "random")
+assert(pic.getFeaturesCol === "features")
+assert(pic.getPredictionCol === "prediction")
+assert(pic.getIdCol === "id")
+  }
+
+  test("set parameters") {
+val pic = new PowerIterationClustering()
+  .setK(9)
+  .setMaxIter(33)
+  .setInitMode("degree")
+  .setFeaturesCol("test_feature")
+  .setPredictionCol("test_prediction")
+  .setIdCol("test_id")
+
+assert(pic.getK === 9)
+assert(pic.getMaxIter === 33)
+assert(pic.getInitMode === "degree")
+assert(pic.getFeaturesCol === "test_feature")
+assert(pic.getPredictionCol === "test_prediction")
+assert(pic.getIdCol === "test_id")
+  }
+
+  test("parameters validation") {
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setInitMode("no_such_a_mode")
+}
+  }
+
+  test("power iteration clustering") {
--- End diff --

Add it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@thunterdb Thanks for your response. In the original JIRA, we have 
discussed why we want it to be a transformer. Let me find it and post it here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
Joseph K. Bradley added a comment - 31/Oct/16 18:14

Miao Wang Sorry for the slow response here. I do want us to add PIC to 
spark.ml, but we should discuss the design before the PR. Could you please 
close the PR for now but save the branch to re-open after discussion?

Let's have a design discussion first.

I agree that the big issue is that there isn't a clear way to make 
predictions on new data points. In fact, I've never heard of people trying to 
do so. Has anyone else?

Assuming that prediction is not meaningful for PIC, then I don't think the 
algorithm fits within the Pipeline framework, though it's debatable. I see a 
few options:

Put PIC in Pipelines as a Transformer, not an Estimator. We would just 
need to document that it is a very expensive Transformer.
Put PIC in spark.ml as a static method. We may have to do this anyways 
to support all of spark.mllib's Statistics.
Put PIC in GraphFrames (and push harder for GraphFrames to be merged 
back into Spark, which will include a much longer set of improvements).

My top choice is PIC as a Transformer. What do you think?

CC Yanbo Liang Seth Hendrickson Nick Pentreath opinions?
sethah Seth Hendrickson added a comment - 31/Oct/16 22:40

This seems like it fits the framework of a feature transformer. We could 
generate a real-valued feature column using PIC algorithm where the values are 
just the components of the pseudo-eigenvector. Alternatively we could pipeline 
a KMeans clustering on the end, but I think it makes more sense to let users do 
that themselves - but that's up for debate.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
Yanbo Liang added a comment - 02/Nov/16 09:30 - edited

I'm prefer to #1 and #3, but it looks like we can achieve both goals.
Graph can be represented by GraphX/GraphFrame or DataFrame/RDD. PIC model 
can be trained on both of them, but we use GraphX operators in the internal 
implementation which means input data should be converted to GraphX 
representation if it's RDD of tuples. So it's straight forward to make PIC as 
one of the algorithms in GraphX(or GraphFrame when it is merged back into 
Spark). However, users may load their graph as DataFrame/RDD and transform via 
ML Pipeline which should also be supported, so it's better we can wrap PIC of 
GraphX/GraphFrame as an Pipeline stage and then ML users can use it as well.
For some historical reasons(we don't want to add new features to GraphX), I 
propose to split this task into the following step:

Put PIC in Pipeline as a Transformer, use the GraphX operators in the 
implementation (This is consistent with Joseph K. Bradley's proposal).
Add PIC algorithms to GraphFrames when it is merged into Spark.
Make the ML PIC as a wrapper to call the GraphFrames PIC implementation.

I think this scenario should be better for different users(ML users and 
GraphFrames users), but still open to hear your thoughts. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/13440
  
@thunterdb Can you take a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-21 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
I am checking ALS out to understand your suggestions. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-22 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@thunterdb Per discussion with Yanbo, there is one concern of making it an 
Estimator. For every `transform`, there is an additional data shuffle. cc 
@yanboliang @jkbradley Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-22 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/17032

[SPARK-19460][SparkR]:Update dataset used in R documentation, examples to 
reduce warning noise and confusions

## What changes were proposed in this pull request?

Replace `iris` dataset with `Titanic` or other dataset in example and 
document.

## How was this patch tested?

Manual and existing test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark example

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17032.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17032


commit 1f574673c59cb6fe1bda265a478c4cc526b03cdd
Author: wm...@hotmail.com 
Date:   2017-02-23T00:49:06Z

remove iris dataset in example and document




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17032: [SPARK-19460][SparkR]:Update dataset used in R documenta...

2017-02-22 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17032
  
cc @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102801880
  
--- Diff: examples/src/main/r/ml/bisectingKmeans.R ---
@@ -25,20 +25,21 @@ library(SparkR)
 sparkR.session(appName = "SparkR-ML-bisectingKmeans-example")
 
 # $example on$
-irisDF <- createDataFrame(iris)
+t <- as.data.frame(Titanic)
+training <- createDataFrame(t)
 
 # Fit bisecting k-means model with four centers
-model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
+model <- spark.bisectingKmeans(training, Class ~ Survived, k = 4)
 
 # get fitted result from a bisecting k-means model
 fitted.model <- fitted(model, "centers")
 
 # Model summary
-summary(fitted.model)
+head(summary(fitted.model))
--- End diff --

In this case, `summary` returns a DataFrame. It won't print out the 
contents of the DataFrame. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102801930
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -565,11 +565,10 @@ We use a simple example to demonstrate `spark.logit` 
usage. In general, there ar
 and 3). Obtain the coefficient matrix of the fitted model using `summary` 
and use the model for prediction with `predict`.
 
 Binomial logistic regression
-```{r, warning=FALSE}
-df <- createDataFrame(iris)
-# Create a DataFrame containing two classes
-training <- df[df$Species %in% c("versicolor", "virginica"), ]
-model <- spark.logit(training, Species ~ ., regParam = 0.00042)
+```{r}
+t <- as.data.frame(Titanic)
+training <- createDataFrame(t)
+model <- spark.logit(training, Sex ~ ., regParam = 0.00042)
--- End diff --

I will check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102801964
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -143,14 +143,15 @@ print.summary.treeEnsemble <- function(x) {
 #'
 #' # fit a Gradient Boosted Tree Classification Model
 #' # label must be binary - Only binary classification is supported for 
GBT.
-#' df <- createDataFrame(iris[iris$Species != "virginica", ])
-#' model <- spark.gbt(df, Species ~ Petal_Length + Petal_Width, 
"classification")
+#' t <- as.data.frame(Titanic)
+#' df <- createDataFrame(t)
+#' model <- spark.gbt(df, Sex ~ Age + Freq, "classification")
 #'
 #' # numeric label is also supported
-#' iris2 <- iris[iris$Species != "virginica", ]
-#' iris2$NumericSpecies <- ifelse(iris2$Species == "setosa", 0, 1)
-#' df <- createDataFrame(iris2)
-#' model <- spark.gbt(df, NumericSpecies ~ ., type = "classification")
+#' t2 <- as.data.frame(Titanic)
+#' t2$NumericGener <- ifelse(t2$Sex == "Male", 0, 1)
--- End diff --

`Gender`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102802291
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -143,14 +143,15 @@ print.summary.treeEnsemble <- function(x) {
 #'
 #' # fit a Gradient Boosted Tree Classification Model
 #' # label must be binary - Only binary classification is supported for 
GBT.
-#' df <- createDataFrame(iris[iris$Species != "virginica", ])
-#' model <- spark.gbt(df, Species ~ Petal_Length + Petal_Width, 
"classification")
+#' t <- as.data.frame(Titanic)
+#' df <- createDataFrame(t)
+#' model <- spark.gbt(df, Sex ~ Age + Freq, "classification")
--- End diff --

Just want to demonstrate with a category variable. I can change it to 
`Survived`. Is it ok?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102802725
  
--- Diff: examples/src/main/r/ml/glm.R ---
@@ -25,11 +25,12 @@ library(SparkR)
 sparkR.session(appName = "SparkR-ML-glm-example")
 
 # $example on$
-irisDF <- suppressWarnings(createDataFrame(iris))
+t <- as.data.frame(Titanic)
+training <- createDataFrame(t)
 # Fit a generalized linear model of family "gaussian" with spark.glm
-gaussianDF <- irisDF
-gaussianTestDF <- irisDF
-gaussianGLM <- spark.glm(gaussianDF, Sepal_Length ~ Sepal_Width + Species, 
family = "gaussian")
+gaussianDF <- training
+gaussianTestDF <- training
--- End diff --

Yes. I agree. I saw other examples using the same dataset as testing. How 
about fixing them all in another follow-up PR? We only focus on fixing the 
`iris` dataset replacement in this PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102823636
  
--- Diff: examples/src/main/r/ml/glm.R ---
@@ -25,11 +25,12 @@ library(SparkR)
 sparkR.session(appName = "SparkR-ML-glm-example")
 
 # $example on$
-irisDF <- suppressWarnings(createDataFrame(iris))
+t <- as.data.frame(Titanic)
+training <- createDataFrame(t)
 # Fit a generalized linear model of family "gaussian" with spark.glm
-gaussianDF <- irisDF
-gaussianTestDF <- irisDF
-gaussianGLM <- spark.glm(gaussianDF, Sepal_Length ~ Sepal_Width + Species, 
family = "gaussian")
+gaussianDF <- training
+gaussianTestDF <- training
--- End diff --

Since this is the example not the vignettes. We can use datasets in the 
data/mllib directory.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102823739
  
--- Diff: examples/src/main/r/ml/bisectingKmeans.R ---
@@ -25,20 +25,21 @@ library(SparkR)
 sparkR.session(appName = "SparkR-ML-bisectingKmeans-example")
 
 # $example on$
-irisDF <- createDataFrame(iris)
+t <- as.data.frame(Titanic)
+training <- createDataFrame(t)
 
 # Fit bisecting k-means model with four centers
-model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
+model <- spark.bisectingKmeans(training, Class ~ Survived, k = 4)
 
 # get fitted result from a bisecting k-means model
 fitted.model <- fitted(model, "centers")
 
 # Model summary
-summary(fitted.model)
+head(summary(fitted.model))
--- End diff --

Will do in follow-up PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102856602
  
--- Diff: examples/src/main/r/ml/kmeans.R ---
@@ -26,10 +26,12 @@ sparkR.session(appName = "SparkR-ML-kmeans-example")
 
 # $example on$
 # Fit a k-means model with spark.kmeans
-irisDF <- suppressWarnings(createDataFrame(iris))
-kmeansDF <- irisDF
-kmeansTestDF <- irisDF
-kmeansModel <- spark.kmeans(kmeansDF, ~ Sepal_Length + Sepal_Width + 
Petal_Length + Petal_Width,
+t <- as.data.frame(Titanic)
--- End diff --

kmeans_data.txt and sample_kmeans_data.txt have fewer data points than 
Titanic. So in this case, I am still using the Titanic dataset.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@thunterdb @yanboliang Do we reach an agreement on whether to make it a 
transformer or an estimator now? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-23 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r102857809
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -565,11 +565,10 @@ We use a simple example to demonstrate `spark.logit` 
usage. In general, there ar
 and 3). Obtain the coefficient matrix of the fitted model using `summary` 
and use the model for prediction with `predict`.
 
 Binomial logistic regression
-```{r, warning=FALSE}
-df <- createDataFrame(iris)
-# Create a DataFrame containing two classes
-training <- df[df$Species %in% c("versicolor", "virginica"), ]
-model <- spark.logit(training, Species ~ ., regParam = 0.00042)
+```{r}
+t <- as.data.frame(Titanic)
+training <- createDataFrame(t)
+model <- spark.logit(training, Sex ~ ., regParam = 0.05805813)
--- End diff --

cv.glmnet tested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17032: [SPARK-19460][SparkR]:Update dataset used in R do...

2017-02-24 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17032#discussion_r103004945
  
--- Diff: examples/src/main/r/ml/glm.R ---
@@ -25,12 +25,12 @@ library(SparkR)
 sparkR.session(appName = "SparkR-ML-glm-example")
 
 # $example on$
-t <- as.data.frame(Titanic)
-training <- createDataFrame(t)
+training <- 
read.df("data/mllib/sample_multiclass_classification_data.txt", source = 
"libsvm")
 # Fit a generalized linear model of family "gaussian" with spark.glm
-gaussianDF <- training
-gaussianTestDF <- training
-gaussianGLM <- spark.glm(gaussianDF, Freq ~ Sex + Age, family = "gaussian")
+set.seed(2)
+gaussianDF <- sample(training, TRUE, 0.7)
+gaussianTestDF <- sample(training, TRUE, 0.3)
--- End diff --

OK. I will change it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-02-26 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@jkbradley Thanks for your reply! I quickly go through your suggestions. If 
I understand correctly, you prefer making it a `Transformer`, as we previously 
discussed, but changing the input data to fit into to the pipeline. Right? Let 
me think about details and evaluate each options before making the changes. 
Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17032: [SPARK-19460][SparkR]:Update dataset used in R documenta...

2017-02-26 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17032
  
@felixcheung I have made the changes per our review discussion. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16393: [SPARK-18993] [Build] Revert Split test-tags into...

2016-12-26 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16393#discussion_r93900810
  
--- Diff: common/tags/pom.xml ---
@@ -34,6 +34,14 @@
 tags
   
 
+  
+
+  org.scalatest
+  scalatest_${scala.binary.version}
+  compile
+
+  
--- End diff --

I hit the same issue after fetching the latest master. I did clean and 
rebuild. It still has the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-03 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/16464

[SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer correctly

## What changes were proposed in this pull request?

spark.lda passes the optimizer "em" or "online" as a string to the backend. 
However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, 
for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE 
based on scala code.

In addition, the `summary` method should bring back the results related to 
`DistributedLDAModel`.

## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add 
necessary tests for `summary` method.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark new

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16464.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16464


commit 277763614934aef2625274e2e55e7efe6eb4c2df
Author: wm...@hotmail.com 
Date:   2017-01-03T23:26:28Z

fix the optimizer bug




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-04 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
cc @felixcheung @yanboliang A bug fix. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-04 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94721880
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -172,6 +187,8 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   model,
   ldaModel.logLikelihood(preprocessedData),
   ldaModel.logPerplexity(preprocessedData),
+  trainingLogLikelihood,
+  logPrior,
--- End diff --

The first version, I got them from the model in the `LDAWrapper` class. 
However, when I read `logPrior`, I found that the loaded `logPrior` is not the 
same as the value before save. So, I followed the logLikelihood and 
logPerplexity to save it in the metadata. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-05 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94866311
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -172,6 +187,8 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   model,
   ldaModel.logLikelihood(preprocessedData),
   ldaModel.logPerplexity(preprocessedData),
+  trainingLogLikelihood,
+  logPrior,
--- End diff --

With the same dataset, Scala side tests:
Original LogPrior:-3.3387459952856338
LogPrior from saved model: -0.9202435107654922



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94983934
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -123,6 +126,10 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   .setMaxIter(maxIter)
   .setSubsamplingRate(subsamplingRate)
 
+if (optimizer == "em") {
--- End diff --

If it is "online", it is not necessary to set the optimizer. But setting it 
anyway will make the code clean. I will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r94984564
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -172,6 +187,8 @@ private[r] object LDAWrapper extends 
MLReadable[LDAWrapper] {
   model,
   ldaModel.logLikelihood(preprocessedData),
   ldaModel.logPerplexity(preprocessedData),
+  trainingLogLikelihood,
+  logPrior,
--- End diff --

LogPrior is calculated based on the serialized topics etc, which are also 
used by the trainingLikelyhood. But the trainingLikelyhood is the same for both 
original and loaded model. Let me debug more. It looks like a bug. The original 
MLLIB implementation doesn't serialize the two parameters as they can be 
calculated from other saved values. In addition, there is no unit test for 
comparing the two values, which could be the reason of not catching this issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
`graph.vertices.aggregate(0.0)(seqOp, _ + _)` when calculating `logPrior` 
gets different values for the two models, even if all parameters of the Model 
are the same. 

I print out each step of `seqOp`, which are different for the two models. I 
am trying to understand the logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
`graph.vertices.aggregate(0.0)(seqOp, _ + _)` the vertex sequence is 
different in the two models.
VertexID sequence for original model:
-8 -6 -3 10 -5 4 11 -1 0 1 -2 6 -7 7 8 -10 -9 9 -4 3 -11 5 2
VertexID sequence for the loaded model:
3 5 2 -11 -5 4 11 -1 0 1 -2 6 -7 -8 7 -10 8 9 -9 -4 10 -6 -3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
I print out the values of each step of `seqOp`, 
`graph.vertices.aggregate(0.0)(seqOp, _ + _)` just returns the `seqOp` of the 
last vertex.
VertexID sequence for original model:
-8 -6 -3 10 -5 4 11 -1 0 1 -2 6 -7 7 8 -10 -9 9 -4 3 -11 5 2
For this sequence, it returns the value of Vertex2.
VertexID sequence for the loaded model:
3 5 2 -11 -5 4 11 -1 0 1 -2 6 -7 -8 7 -10 8 9 -9 -4 10 -6 -3
For this sequence, it returns the value of Vertex -3.

Actually, the value of Vertex2 is the same in both sequences (same of 
Vertex -3 too).

It seems that `graph.vertices.aggregate(0.0)(seqOp, _ + _)`  doesn't 
returns the aggregated value as it is supposed to return (i.e., 
`sum(SeqOp(Vertex_i))` ). 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
I think it is a bug. I will file a PR to fix it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16491: [SPARK-19110][ML][MLLIB]:DistributedLDAModel retu...

2017-01-06 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/16491

[SPARK-19110][ML][MLLIB]:DistributedLDAModel returns different logPrior for 
original and loaded model

## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that 
the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only 
returns the value of a single vertex instead of the aggregation of all 
vertices. Therefore, when the loaded model does the aggregation in a different 
order, it returns different `logPrior`.

Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark ldabug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16491.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16491


commit 31d9bce7c1de016fa23f1783e41aa8b394a68cc4
Author: wm...@hotmail.com 
Date:   2017-01-06T20:40:54Z

fix the bug of logPrior




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16491: [SPARK-19110][ML][MLLIB]:DistributedLDAModel returns dif...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16491
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
@felixcheung PR #16491 is filed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
Sure. I will revert it to the previous commit once the 16491 is in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16491: [SPARK-19110][ML][MLLIB]:DistributedLDAModel retu...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16491#discussion_r95036829
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -260,6 +260,14 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 Vectors.dense(model2.topicsMatrix.toArray) absTol 1e-6)
   assert(Vectors.dense(model.getDocConcentration) ~==
 Vectors.dense(model2.getDocConcentration) absTol 1e-6)
+  val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
+  val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
+  val trainingLogLikelihood =
+model.asInstanceOf[DistributedLDAModel].trainingLogLikelihood
+  val trainingLogLikelihood2 =
+model2.asInstanceOf[DistributedLDAModel].trainingLogLikelihood
+  assert(logPrior ~== logPrior2 absTol 1e-6)
+  assert(trainingLogLikelihood ~== trainingLogLikelihood2 absTol 1e-6)
--- End diff --

`logLikelihood` and `logPrior` are only for distributed model. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16491: [SPARK-19110][ML][MLLIB]:DistributedLDAModel retu...

2017-01-06 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16491#discussion_r95045854
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -260,6 +260,14 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 Vectors.dense(model2.topicsMatrix.toArray) absTol 1e-6)
   assert(Vectors.dense(model.getDocConcentration) ~==
 Vectors.dense(model2.getDocConcentration) absTol 1e-6)
+  val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
+  val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
+  val trainingLogLikelihood =
+model.asInstanceOf[DistributedLDAModel].trainingLogLikelihood
+  val trainingLogLikelihood2 =
+model2.asInstanceOf[DistributedLDAModel].trainingLogLikelihood
+  assert(logPrior ~== logPrior2 absTol 1e-6)
+  assert(trainingLogLikelihood ~== trainingLogLikelihood2 absTol 1e-6)
--- End diff --

`LocalLDAModel` doesn't extend `DistributedLDAModel` and vice versa. I am 
not clear how to check `trainingLogLikelihood ` and `logPrior` in 
`LocalLDAModel`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-09 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
@felixcheung I made modifications and don't save the two metrics of 
DistributedModels.

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16523: [SPARK-19142][SparkR]:spark.kmeans should take se...

2017-01-09 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/16523

[SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as 
parameters

## What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As 
Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we 
should maintain a different interface in spark.kmeans.

Add missing parameters and corresponding document.

Modified existing unit tests to take additional parameters.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark kmeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16523.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16523


commit 8379018ad2c40818da2c6d4bd950ab62dc0eedf4
Author: wm...@hotmail.com 
Date:   2017-01-09T23:50:36Z

take additional parameters for spark.kmeans




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16524: [SPARK-19110][MLLIB][FollowUP]: Add a unit test

2017-01-09 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/16524

[SPARK-19110][MLLIB][FollowUP]: Add a unit test

## What changes were proposed in this pull request?
#16491 added the fix to mllib and a unit test to ml. This followup PR, add 
unit tests to mllib suite.

## How was this patch tested?
Unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark ldabug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16524.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16524


commit 37e4e6a7e4aa8381ea51589f4e27bde3f24defdd
Author: wm...@hotmail.com 
Date:   2017-01-10T00:10:32Z

add a test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16523: [SPARK-19142][SparkR]:spark.kmeans should take seed, ini...

2017-01-09 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16523
  
cc @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16523: [SPARK-19142][SparkR]:spark.kmeans should take se...

2017-01-10 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16523#discussion_r95429832
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -204,11 +208,16 @@ setMethod("write.ml", signature(object = 
"GaussianMixtureModel", path = "charact
 #' @note spark.kmeans since 2.0.0
 #' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
 setMethod("spark.kmeans", signature(data = "SparkDataFrame", formula = 
"formula"),
-  function(data, formula, k = 2, maxIter = 20, initMode = 
c("k-means||", "random")) {
+  function(data, formula, k = 2, maxIter = 20, initMode = 
c("k-means||", "random"),
+   seed = NULL, initSteps = 2, tol = 1E-4) {
 formula <- paste(deparse(formula), collapse = "")
 initMode <- match.arg(initMode)
+if (!is.null(seed)) {
+  seed <- as.character(as.integer(seed))
--- End diff --

I followed one example in the file. Let me investigate more about this 
issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16523: [SPARK-19142][SparkR]:spark.kmeans should take se...

2017-01-10 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16523#discussion_r95429974
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib_clustering.R ---
@@ -99,7 +99,8 @@ test_that("spark.kmeans", {
 
   take(training, 1)
 
-  model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, 
initMode = "random")
+  model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, 
initMode = "random", seed = 1,
+initSteps = 3, tol = 1E-5)
--- End diff --

Sounds good. I will compose a test that is really controlled by the 
parameters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16524: [SPARK-19110][MLLIB][FollowUP]: Add a unit test for test...

2017-01-10 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16524
  
@srowen Title is updated. Unit tests for [ML] have been added in the 
original fix, but the MLLIB case is not added. So this followup adds back a 
simple unit test for the two parameters of DistributedLDAModel.

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-10 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r95436115
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
@@ -45,6 +45,11 @@ private[r] class LDAWrapper private (
   import LDAWrapper._
 
   private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
+  private val distributedMoel = lda.isDistributed match {
--- End diff --

Fixed. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16523: [SPARK-19142][SparkR]:spark.kmeans should take se...

2017-01-11 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16523#discussion_r95640984
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib_clustering.R ---
@@ -99,7 +99,8 @@ test_that("spark.kmeans", {
 
   take(training, 1)
 
-  model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, 
initMode = "random")
+  model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, 
initMode = "random", seed = 1,
+initSteps = 3, tol = 1E-5)
--- End diff --

I tested existing unit tests in KmeansSuite.scala in both ML and MLLIB. 
They are all insensitive to `seed` and `tol`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16523: [SPARK-19142][SparkR]:spark.kmeans should take seed, ini...

2017-01-12 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16523
  
@felixcheung I will review these items after wrapping up my current work. 
Now I am working on two items: The bug 18011; and bisecting kmeans.

bisecting kmeans should be ready soon. Bug 18011 needs more debugging. 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16566: [SparkR]: add bisecting kmeans R wrapper

2017-01-12 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/16566

[SparkR]: add bisecting kmeans R wrapper

## What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans. 

As JIRA is down, I will update title to link with corresponding JIRA later.

## How was this patch tested?

Add new unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark bk

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16566.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16566


commit 4f88cce8fc7aa880d886f93582f6245775953fad
Author: wm...@hotmail.com 
Date:   2017-01-13T00:36:47Z

add bisecting kmeans R wrapper




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16566: [SparkR]: add bisecting kmeans R wrapper

2017-01-13 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16566
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set optimizer c...

2017-01-13 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16464
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96047751
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

When returning NULL,
`stats
$docConcentration
 [1] 0.09117984 0.09117993 0.09117964 0.09117969 0.09117991 0.13010149
 [7] 0.09117986 0.09117983 0.09192744 0.09117969

$topicConcentration
[1] 0.1

$logLikelihood
[1] -395.4505

$logPerplexity
[1] 2.995837

$isDistributed
[1] FALSE

$vocabSize
[1] 10

$topics
SparkDataFrame[topic:int, term:array, termWeights:array]

$vocabulary
 [1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"

$trainingLogLikelihood
NULL

$logPrior
NULL`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96049008
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

When returning NA
`> stats
$docConcentration
 [1] 0.09116572 0.13013764 0.09116557 0.09116559 0.09116584 0.09116617
 [7] 0.09116576 0.09116573 0.09116569 0.09116564

$topicConcentration
[1] 0.1

$logLikelihood
[1] -392.9573

$logPerplexity
[1] 2.976949

$isDistributed
[1] FALSE

$vocabSize
[1] 10

$topics
SparkDataFrame[topic:int, term:array, termWeights:array]

$vocabulary
 [1] "0" "1" "2" "3" "4" "9" "5" "8" "7" "6"

$trainingLogLikelihood
[1] NA

$logPrior
[1] NA
`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-13 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96049147
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

They look very similar. I am fine with any of the two. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...

2017-01-14 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16464#discussion_r96125283
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -404,11 +411,14 @@ setMethod("summary", signature(object = "LDAModel"),
 vocabSize <- callJMethod(jobj, "vocabSize")
 topics <- dataFrame(callJMethod(jobj, "topics", 
maxTermsPerTopic))
 vocabulary <- callJMethod(jobj, "vocabulary")
+trainingLogLikelihood <- callJMethod(jobj, 
"trainingLogLikelihood")
+logPrior <- callJMethod(jobj, "logPrior")
--- End diff --

@felixcheung I will update it to NA. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16566: [SPARK-18821][SparkR]: Bisecting k-means wrapper ...

2017-01-14 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16566#discussion_r96125393
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -38,6 +45,146 @@ setClass("KMeansModel", representation(jobj = "jobj"))
 #' @note LDAModel since 2.1.0
 setClass("LDAModel", representation(jobj = "jobj"))
 
+#' Bisecting K-Means Clustering Model
+#'
+#' Fits a bisecting k-means clustering model against a Spark DataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.bisectingKmeans.
+#' @param k the desired number of leaf clusters. Must be > 1.
+#'  The actual number could be smaller if there are no divisible 
leaf clusters.
+#' @param maxIter maximum iteration number.
+#' @param minDivisibleClusterSize The minimum number of points (if greater 
than or equal to 1.0)
+#'or the minimum proportion of points (if 
less than 1.0) of a divisible cluster.
+#' @param seed the random seed.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.bisectingKmeans} returns a fitted bisecting k-means 
model.
+#' @rdname spark.bisectingKmeans
+#' @aliases spark.bisectingKmeans,SparkDataFrame,formula-method
+#' @name spark.bisectingKmeans
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' data(iris)
+#' df <- createDataFrame(iris)
+#' model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "Sepal_Length", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.bisectingKmeans since 2.2.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.bisectingKmeans", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 4, maxIter = 20, 
minDivisibleClusterSize = 1.0, seed = NULL) {
--- End diff --

I will address comments soon. Now, debugging. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16566: [SPARK-18821][SparkR]: Bisecting k-means wrapper ...

2017-01-15 Thread wangmiao1981

Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16566#discussion_r96165457
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -38,6 +45,146 @@ setClass("KMeansModel", representation(jobj = "jobj"))
 #' @note LDAModel since 2.1.0
 setClass("LDAModel", representation(jobj = "jobj"))
 
+#' Bisecting K-Means Clustering Model
+#'
+#' Fits a bisecting k-means clustering model against a Spark DataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
+#'
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#'Note that the response variable of formula is empty in 
spark.bisectingKmeans.
+#' @param k the desired number of leaf clusters. Must be > 1.
+#'  The actual number could be smaller if there are no divisible 
leaf clusters.
+#' @param maxIter maximum iteration number.
+#' @param minDivisibleClusterSize The minimum number of points (if greater 
than or equal to 1.0)
+#'or the minimum proportion of points (if 
less than 1.0) of a divisible cluster.
+#' @param seed the random seed.
+#' @param ... additional argument(s) passed to the method.
+#' @return \code{spark.bisectingKmeans} returns a fitted bisecting k-means 
model.
+#' @rdname spark.bisectingKmeans
+#' @aliases spark.bisectingKmeans,SparkDataFrame,formula-method
+#' @name spark.bisectingKmeans
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' data(iris)
+#' df <- createDataFrame(iris)
+#' model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
+#' summary(model)
+#'
+#' # fitted values on training data
+#' fitted <- predict(model, df)
+#' head(select(fitted, "Sepal_Length", "prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.bisectingKmeans since 2.2.0
+#' @seealso \link{predict}, \link{read.ml}, \link{write.ml}
+setMethod("spark.bisectingKmeans", signature(data = "SparkDataFrame", 
formula = "formula"),
+  function(data, formula, k = 4, maxIter = 20, 
minDivisibleClusterSize = 1.0, seed = NULL) {
+formula <- paste0(deparse(formula), collapse = "")
+if (!is.null(seed)) {
+  seed <- as.character(as.integer(seed))
+}
+jobj <- 
callJStatic("org.apache.spark.ml.r.BisectingKMeansWrapper", "fit",
+data@sdf, formula, as.integer(k), 
as.integer(maxIter),
+as.numeric(minDivisibleClusterSize), seed)
+new("BisectingKMeansModel", jobj = jobj)
+  })
+
+#  Get the summary of a bisecting k-means model
+
+#' @param object a fitted bisecting k-means model.
+#' @return \code{summary} returns summary information of the fitted model, 
which is a list.
+#' The list includes the model's \code{k} (number of cluster 
centers),
+#' \code{coefficients} (model cluster centers),
+#' \code{size} (number of data points in each cluster), and 
\code{cluster}
+#' (cluster centers of the transformed data).
+#' @rdname spark.bisectingKmeans
+#' @export
+#' @note summary(BisectingKMeansModel) since 2.2.0
+setMethod("summary", signature(object = "BisectingKMeansModel"),
+  function(object) {
+jobj <- object@jobj
+is.loaded <- callJMethod(jobj, "isLoaded")
+features <- callJMethod(jobj, "features")
+coefficients <- callJMethod(jobj, "coefficients")
+k <- callJMethod(jobj, "k")
+size <- callJMethod(jobj, "size")
+coefficients <- t(matrix(coefficients, ncol = k))
+colna

[GitHub] spark pull request #15365: [SPARK-17157][SPARKR]: Add multiclass logistic re...

2016-10-05 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/15365

[SPARK-17157][SPARKR]: Add multiclass logistic regression SparkR Wrapper

## What changes were proposed in this pull request?

As we discussed in #14818, I added a separate R wrapper spark.logit for 
logistic regression. 

This single interface supports both binary and multinomial logistic 
regression. It also has "predict" and "summary" for binary logistic regression. 

## How was this patch tested?

New unit tests are added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark glm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15365.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15365


commit 7f3d40d090604e1edeadfe7e747869290be61c2f
Author: wm...@hotmail.com 
Date:   2016-10-05T00:23:18Z

add spark.logit

commit 5bfd132791bed933b10dc3af8f0857d6be99
Author: wm...@hotmail.com 
Date:   2016-10-05T23:36:54Z

add unit tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14818: [SPARK-17157][SPARKR][WIP]: Add multiclass logist...

2016-10-05 Thread wangmiao1981

Github user wangmiao1981 closed the pull request at:

https://github.com/apache/spark/pull/14818


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14818: [SPARK-17157][SPARKR][WIP]: Add multiclass logistic regr...

2016-10-05 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/14818
  
@felixcheung I opened #15365 based on our discussion in this PR. I close 
this PR now and please review #15365 . Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...

2016-10-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15365
  
@felixcheung  When run check-cran, there are errors:
 Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE
Error : requireNamespace("e1071", quietly = TRUE) is not TRUE

I am trying to figure out what is the problem. Any hints?

Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...

2016-10-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15365
  
From Jekins, I saw the error:
WARNING: There was 1 warning.
NOTE: There were 3 notes.
See
  
'/home/jenkins/workspace/SparkPullRequestBuilder/R/SparkR.Rcheck/00check.log'
for details.

Had CRAN check errors; see logs.

How can I access the above file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...

2016-10-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15365
  
I see the error message on local test:

LaTeX errors when creating PDF version.
This typically indicates Rd problems.
* checking PDF version of manual without hyperrefs or index ... ERROR
Re-running with no redirection of stdout/stderr.
Hmm ... looks like a package
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  pdflatex is not available
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  pdflatex is not available
Error in running tools::texi2pdf()
You may want to clean up by 'rm -rf 
/var/folders/s_/83b0sgvj2kl2kwq4stvft_pmgn/T//RtmpXHJrOk/Rd2pdfac8961d3ab54'
* DONE




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...

2016-10-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15365
  
@vectorijk  Thanks for your information! I installed e1071 and installed 
tex package. I just want to find what causes the error.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...

2016-10-06 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15365
  
My local tests passed.

Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 688 matches

Mail list logo