[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r182829257 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("model evaluateEachIteration") { +for (lossType <- Seq("logistic")) { --- End diff -- OK. It makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r182603253 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("model evaluateEachIteration") { +for (lossType <- Seq("logistic")) { --- End diff -- there is only one lossType. `for` is not necessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21090: [SPARK-15784][ML] Add Power Iteration Clustering ...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/21090#discussion_r182254888 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types._ + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be 1. Default: 2. + * @group param + */ + @Since("2.4.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.4.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use a normalized sum of similarities with other vertices. + * Default: random. + * @group expertParam + */ + @Since("2.4.0") + final val initMode = { +val allowedParams = ParamValidators.inArray(Array("random", "degree")) +new Param[String](this, "initMode", "The initialization algorithm. This can be either " + + "'random' to use a random vector as vertex properties, or 'degree' to use a normalized sum " + + "of similarities with other vertices. Supported options: 'random' and 'degree'.", + allowedParams) + } + + /** @group expertGetParam */ + @Since("2.4.0") + def getInitMode: String = $(initMode) + + /** + * Param for the name of the input column for vertex IDs. + * Default: "id" + * @group param + */ + @Since("2.4.0") + val idCol = new Param[String](this, "idCol", "Name of the input column for vertex IDs.", +(value: String) => value.nonEmpty) + + setDefault(idCol, "id") + + /** @group getParam */ + @Since("2.4.0") + def getIdCol: String = getOrDefault(idCol) + + /** + * Param for the name of the input column for neighbors in the adjacency list representation. + * Default: "neighbors" + * @group param + */ + @Since("2.4.0") + val neighborsCol = new Param[String](this, "neighborsCol", +"Name of the input column for neighbors in the adjacency list representation.", +(value: String) => value.nonEmpty) + + setDefault(neighborsCol, "neighbors") + + /** @group getParam */ + @Since("2.4.0") + def getNeighborsCol: String = $(neighborsCol) + + /** + * Param for the name of the input column for neighbors in the adjacency list representation. + * Default: "similarities" + * @group param + */ + @Since("2.4.0") + val similaritiesCol = new Param[String](this, "similaritiesCol", +"Name of the input column for neighbors in the adjacency list representation.", +(value: String) => value.nonEmpty) + + setDefault(similaritiesCol, "sim
[GitHub] spark issue #21090: [SPARK-15784][ML] Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/21090 Take a quick look. Despite of the style failure and a minor format issue, LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21090: [SPARK-15784][ML] Add Power Iteration Clustering ...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/21090#discussion_r182243819 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala --- @@ -0,0 +1,239 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import scala.collection.mutable + +import org.apache.spark.ml.util.DefaultReadWriteTest +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types._ +import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} +import org.apache.spark.{SparkException, SparkFunSuite} + + +class PowerIterationClusteringSuite extends SparkFunSuite + with MLlibTestSparkContext with DefaultReadWriteTest { + + @transient var data: Dataset[_] = _ + final val r1 = 1.0 + final val n1 = 10 + final val r2 = 4.0 + final val n2 = 40 + + override def beforeAll(): Unit = { +super.beforeAll() + +data = PowerIterationClusteringSuite.generatePICData(spark, r1, r2, n1, n2) + } + + test("default parameters") { +val pic = new PowerIterationClustering() + +assert(pic.getK === 2) +assert(pic.getMaxIter === 20) +assert(pic.getInitMode === "random") +assert(pic.getPredictionCol === "prediction") +assert(pic.getIdCol === "id") +assert(pic.getNeighborsCol === "neighbors") +assert(pic.getSimilaritiesCol === "similarities") + } + + test("parameter validation") { +intercept[IllegalArgumentException] { + new PowerIterationClustering().setK(1) +} +intercept[IllegalArgumentException] { + new PowerIterationClustering().setInitMode("no_such_a_mode") +} +intercept[IllegalArgumentException] { + new PowerIterationClustering().setIdCol("") +} +intercept[IllegalArgumentException] { + new PowerIterationClustering().setNeighborsCol("") +} +intercept[IllegalArgumentException] { + new PowerIterationClustering().setSimilaritiesCol("") +} + } + + test("power iteration clustering") { +val n = n1 + n2 + +val model = new PowerIterationClustering() + .setK(2) + .setMaxIter(40) +val result = model.transform(data) + +val predictions = Array.fill(2)(mutable.Set.empty[Long]) +result.select("id", "prediction").collect().foreach { + case Row(id: Long, cluster: Integer) => predictions(cluster) += id +} +assert(predictions.toSet == Set((1 until n1).toSet, (n1 until n).toSet)) + +val result2 = new PowerIterationClustering() + .setK(2) + .setMaxIter(10) + .setInitMode("degree") + .transform(data) +val predictions2 = Array.fill(2)(mutable.Set.empty[Long]) +result2.select("id", "prediction").collect().foreach { + case Row(id: Long, cluster: Integer) => predictions2(cluster) += id +} +assert(predictions2.toSet == Set((1 until n1).toSet, (n1 until n).toSet)) + } + + test("supported input types") { +val model = new PowerIterationClustering() + .setK(2) + .setMaxIter(1) + +def runTest(idType: DataType, neighborType: DataType, similarityType: DataType): Unit = { + val typedData = data.select( +col("id").cast(idType).alias("id"), +col("neighbors").cast(ArrayType(neighborType, containsNull = false)).alias("neighbors"), +col("similarities").cast(ArrayType(similarityType, containsNull = false)) + .alias("similarities") + )
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user wangmiao1981 closed the pull request at: https://github.com/apache/spark/pull/15770 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 @jkbradley I close this one now. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 @jkbradley Sorry for missing your comments. Anyway, I will close it now. I will choose another one to work on. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 ping @yanboliang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 ping @yanboliang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 @weichenXu123 Any other comments? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 @WeichenXu123 Thanks for your review and reply! I agree with you that the helper can be discussed later for potential enhancement. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 @WeichenXu123 , for the graph helper, the Mllib has a version takes `Graph[Double, Double]` as a parameter for training. In ML, do we have to provide `DataSet` of `Graph`? Can you specify the requirement? I have addressed your other comments. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r143078744 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol with HasWeightCol { + + /** + * The number of clusters to create (k). Must be 1. Default: 2. + * @group param + */ + @Since("2.3.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.3.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.3.0") + final val initMode = { +val allowedParams = ParamValidators.inArray(Array("random", "degree")) +new Param[String](this, "initMode", "The initialization algorithm. " + + "Supported options: 'random' and 'degree'.", allowedParams) + } + + /** @group expertGetParam */ + @Since("2.3.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by PowerIterationClustering.transform(). + * Default: "id" + * @group param + */ + @Since("2.3.0") + val idCol = new Param[String](this, "id", "column name for ids.") + + /** @group getParam */ + @Since("2.3.0") + def getIdCol: String = $(idCol) + + /** + * Param for the column name for neighbors required by PowerIterationClustering.transform(). + * Default: "neighbor" + * @group param + */ + @Since("2.3.0") + val neighborCol = new Param[String](this, "neighbor", "column name for neighbors.") + + /** @group getParam */ + @Since("2.3.0") + def getNeighborCol: String = $(neighborCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { +SchemaUtils.checkColumnType(schema, $(idCol), LongType) +SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType) + } +} + +/** + * :: Experimental :: + * Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + * http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From the abstract: + * PIC finds a very low-dimensional embedding of a dataset using truncated power + * iteration on a normalized pair-wise similarity matrix of the data. + * + * Note that we implement [[PowerIterationClustering]] as a transformer. The [[transform]] is
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r143078479 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol with HasWeightCol { + + /** + * The number of clusters to create (k). Must be 1. Default: 2. + * @group param + */ + @Since("2.3.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.3.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.3.0") + final val initMode = { +val allowedParams = ParamValidators.inArray(Array("random", "degree")) +new Param[String](this, "initMode", "The initialization algorithm. " + + "Supported options: 'random' and 'degree'.", allowedParams) + } + + /** @group expertGetParam */ + @Since("2.3.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by PowerIterationClustering.transform(). + * Default: "id" + * @group param + */ + @Since("2.3.0") + val idCol = new Param[String](this, "id", "column name for ids.") + + /** @group getParam */ + @Since("2.3.0") + def getIdCol: String = $(idCol) + + /** + * Param for the column name for neighbors required by PowerIterationClustering.transform(). + * Default: "neighbor" + * @group param + */ + @Since("2.3.0") + val neighborCol = new Param[String](this, "neighbor", "column name for neighbors.") + + /** @group getParam */ + @Since("2.3.0") + def getNeighborCol: String = $(neighborCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { +SchemaUtils.checkColumnType(schema, $(idCol), LongType) +SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType) + } +} + +/** + * :: Experimental :: + * Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + * http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From the abstract: + * PIC finds a very low-dimensional embedding of a dataset using truncated power + * iteration on a normalized pair-wise similarity matrix of the data. + * + * Note that we implement [[PowerIterationClustering]] as a transformer. The [[transform]] is
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 I will address the review comments soon. Thanks! @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 ping @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 ping @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 @WeichenXu123 I have made changes based on your comments. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 info] Main Scala API documentation successful. [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code [error] Total time: 95 s, completed Aug 15, 2017 4:59:59 PM [error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc ; received return code 1 It seems irrelevant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 Jenkins, retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 weird. Local style test passed. Anyway, I changed the order as required by Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r133271527 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol with HasWeightCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = { +val allowedParams = ParamValidators.inArray(Array("random", "degree")) +new Param[String](this, "initMode", "The initialization algorithm. " + + "Supported options: 'random' and 'degree'.", allowedParams) + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "id", "column name for ids.") + + /** @group getParam */ + def getIdCol: String = $(idCol) + + /** + * Param for the column name for neighbors required by [[PowerIterationClustering.transform()]]. + * Default: "neighbor" + * @group param + */ + val neighborCol = new Param[String](this, "neighbor", "column name for neighbors.") + + /** @group getParam */ + def getNeighborCol: String = $(neighborCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { +SchemaUtils.checkColumnType(schema, $(idCol), LongType) +SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType) + } +} + +/** + * :: Experimental :: + * Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + * http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From the abstract: + * PIC finds a very low-dimensional embedding of a dataset using truncated power + * iteration on a normalized pair-wise similarity matrix of the data. + * + * Note that we implement [[PowerIterationClustering]] as a transformer. The [[transform]] is an + * expensive operation, because it uses PIC algorithm to cluster the whole input dataset. + * + * @see http:/
[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/15770#discussion_r133267575 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType} + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasFeaturesCol with HasPredictionCol with HasWeightCol { + + /** + * The number of clusters to create (k). Must be > 1. Default: 2. + * @group param + */ + @Since("2.2.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.2.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use normalized sum similarities. Default: random. + */ + @Since("2.2.0") + final val initMode = { +val allowedParams = ParamValidators.inArray(Array("random", "degree")) +new Param[String](this, "initMode", "The initialization algorithm. " + + "Supported options: 'random' and 'degree'.", allowedParams) + } + + /** @group expertGetParam */ + @Since("2.2.0") + def getInitMode: String = $(initMode) + + /** + * Param for the column name for ids returned by [[PowerIterationClustering.transform()]]. + * Default: "id" + * @group param + */ + val idCol = new Param[String](this, "id", "column name for ids.") + + /** @group getParam */ + def getIdCol: String = $(idCol) + + /** + * Param for the column name for neighbors required by [[PowerIterationClustering.transform()]]. + * Default: "neighbor" + * @group param + */ + val neighborCol = new Param[String](this, "neighbor", "column name for neighbors.") + + /** @group getParam */ + def getNeighborCol: String = $(neighborCol) + + /** + * Validates the input schema + * @param schema input schema + */ + protected def validateSchema(schema: StructType): Unit = { +SchemaUtils.checkColumnType(schema, $(idCol), LongType) +SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType) + } +} + +/** + * :: Experimental :: + * Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + * http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From the abstract: + * PIC finds a very low-dimensional embedding of a dataset using truncated power + * iteration on a normalized pair-wise similarity matrix of the data. + * + * Note that we implement [[PowerIterationClustering]] as a transformer. The [[transform]] is an + * expensive operation, because it uses PIC algorithm to cluster the whole input dataset. + * + * @see http:/
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 @WeichenXu123 Thanks for reviewing! I will address the comments soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18605 @felixcheung Can you take a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18605 @yanboliang I have made changes accordingly. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18605 @yanboliang Thanks for your reply! I will change the unit tests now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18605 @yanboliang after #18613, unit tests fails if "skip" is used. For example, data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE), someString = base::sample(c("this", "that"), 10, replace = TRUE), stringsAsFactors = FALSE) trainidxs <- base::sample(nrow(data), nrow(data) * 0.7) traindf <- as.DataFrame(data[trainidxs, ]) testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other"))) model <- spark.mlp(traindf, clicked ~ ., layers = c(1, 3), handleInvalid = "keep") predictions <- predict(model, testdf) expect_equal(class(collect(predictions)$clicked[1]), "character") It fails the as if "error" is used. If I change "skip" to "keep", then the predictions$click[0] is NULL. > collect(predictions) [1] clickedsomeString prediction <0 rows> (or 0-length row.names) > collect(predictions)$click[1] [[1]] NULL I am not sure whether this is expected or there is a bug. Before, the units work fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18605 Sure. I am reading the #18613 comments. Just come back from a business travel. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18613: [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should han...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18613 @felixcheung I agree. We should make changes in Scala side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18605 Trigger windows check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18605 Reopen for windows check --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleIn...
GitHub user wangmiao1981 reopened a pull request: https://github.com/apache/spark/pull/18605 [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid for classification algorithms ## What changes were proposed in this pull request? SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR. This is a followup PR for SPARK-20307. ## How was this patch tested? New Unit tests are added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark class Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18605.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18605 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleIn...
Github user wangmiao1981 closed the pull request at: https://github.com/apache/spark/pull/18605 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18605 @felixcheung This is a follow-up PR of JIRA-20307. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleIn...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/18605 [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid for classification algorithms ## What changes were proposed in this pull request? SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR. This is a followup PR for SPARK-20307. ## How was this patch tested? New Unit tests are added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark class Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18605.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18605 commit 77b04a37e93d6967def24c0a8265ed784875f5b0 Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-07-12T00:40:58Z add handleInvalid for classifications --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18496 #14850 is the PR printing the full stack. We can improve it by print the cause instead of print stack. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18496 I will review all classifiers to add the handleInvalid when necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18496 Actually, the udf in transform() of StringIndexer.scala, will throw an exception in action. But, it doesn't stop the execution of collect(). val indexer = udf { label: String => if (label == null) { if (keepInvalid) { labels.length } else { throw new SparkException("StringIndexer encountered NULL value. To handle or skip " + "NULLS, try setting StringIndexer.handleInvalid.") } } else { if (labelToIndex.contains(label)) { labelToIndex(label) } else if (keepInvalid) { labels.length } else { throw new SparkException(s"Unseen label: $label. To handle unseen labels, " + s"set Param handleInvalid to ${StringIndexer.KEEP_INVALID}.") <=== this is the exception. } } } I am asking other people who are familiar with this logic to understand why it doesn't stop the collect(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18496 I did a quick debug: In DataSet.scala def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = { val qe = sparkSession.sessionState.executePlan(logicalPlan)< This line throws Method threw 'org.apache.spark.SparkException' exception. Cannot evaluate org.apache.spark.sql.execution.QueryExecution.toString() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18496 @felixcheung Yes. I think we can improve scala side. It only throws exception when a `NULL` field is given. For unseen labels, as the example above, it always fails at the same place `double` to `string`. The scala side doesn't capture this exception and let it go into the handling logic to cause the failure. I will try to address it in a follow-up PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15770 @yanboliang Can you take a look first? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18496#discussion_r126198035 --- Diff: R/pkg/tests/fulltests/test_mllib_tree.R --- @@ -212,6 +212,23 @@ test_that("spark.randomForest", { expect_equal(length(grep("1.0", predictions)), 50) expect_equal(length(grep("2.0", predictions)), 50) + # Test unseen labels + data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE), +someString = base::sample(c("this", "that"), 10, replace = TRUE), +stringsAsFactors = FALSE) + trainidxs <- base::sample(nrow(data), nrow(data) * 0.7) + traindf <- as.DataFrame(data[trainidxs, ]) + testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other"))) + model <- spark.randomForest(traindf, clicked ~ ., type = "classification", + maxDepth = 10, maxBins = 10, numTrees = 10) + predictions <- predict(model, testdf) + expect_error(collect(predictions)) --- End diff -- On Scala side, I created a case where unseen label is used in test data: `val data: Seq[(Int, String)] = Seq((0, "a"), (1, "b"), (2, "b"), (3, null)) val data2: Seq[(Int, String)] = Seq((0, "a"), (1, "b"), (3, "d")) val df = data.toDF("id", "label") val df2 = data2.toDF("id", "label") val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("labelIndex") indexer.setHandleInvalid("error") indexer.fit(df).transform(df2).collect() ` It also fails with same error message as R case. I think it is the expected behavior for `"error"`. Failed Messages: Failed to execute user defined function($anonfun$9: (string) => double) org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:139) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18496#discussion_r125954907 --- Diff: R/pkg/tests/fulltests/test_mllib_tree.R --- @@ -212,6 +212,23 @@ test_that("spark.randomForest", { expect_equal(length(grep("1.0", predictions)), 50) expect_equal(length(grep("2.0", predictions)), 50) + # Test unseen labels + data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE), +someString = base::sample(c("this", "that"), 10, replace = TRUE), +stringsAsFactors = FALSE) + trainidxs <- base::sample(nrow(data), nrow(data) * 0.7) + traindf <- as.DataFrame(data[trainidxs, ]) + testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other"))) + model <- spark.randomForest(traindf, clicked ~ ., type = "classification", + maxDepth = 10, maxBins = 10, numTrees = 10) + predictions <- predict(model, testdf) + expect_error(collect(predictions)) --- End diff -- Let me check how "error" option is handled. It seems that there is no exception thrown out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18496#discussion_r125802201 --- Diff: R/pkg/tests/fulltests/test_mllib_tree.R --- @@ -212,6 +212,23 @@ test_that("spark.randomForest", { expect_equal(length(grep("1.0", predictions)), 50) expect_equal(length(grep("2.0", predictions)), 50) + # Test unseen labels + data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE), +someString = base::sample(c("this", "that"), 10, replace = TRUE), +stringsAsFactors = FALSE) + trainidxs <- base::sample(nrow(data), nrow(data) * 0.7) + traindf <- as.DataFrame(data[trainidxs, ]) + testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other"))) + model <- spark.randomForest(traindf, clicked ~ ., type = "classification", + maxDepth = 10, maxBins = 10, numTrees = 10) + predictions <- predict(model, testdf) + expect_error(collect(predictions)) --- End diff -- The console prints out : Error in handleErrors(returnStatus, conn) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double) Shall I match this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18496#discussion_r125703030 --- Diff: R/pkg/tests/fulltests/test_mllib_tree.R --- @@ -212,6 +212,23 @@ test_that("spark.randomForest", { expect_equal(length(grep("1.0", predictions)), 50) expect_equal(length(grep("2.0", predictions)), 50) + # Test unseen labels + data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE), +someString = base::sample(c("this", "that"), 10, replace = TRUE), +stringsAsFactors = FALSE) + trainidxs <- base::sample(nrow(data), nrow(data) * 0.7) + traindf <- as.DataFrame(data[trainidxs, ]) + testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other"))) + model <- spark.randomForest(traindf, clicked ~ ., type = "classification", + maxDepth = 10, maxBins = 10, numTrees = 10) + predictions <- predict(model, testdf) + expect_error(collect(predictions)) --- End diff -- The training call has no error because it has no unseen label. I think the internal has logic handling unseen label but when doing collection (action), it can't map the internal value to the unseen label. That is the reason why it only fails when doing collection. I will add the error string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18496#discussion_r125702340 --- Diff: R/pkg/R/mllib_tree.R --- @@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = "GBTClassificationModel", path = "chara #' nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching #' can speed up training of deeper trees. Users can set how often should the #' cache be checkpointed or disable it by setting checkpointInterval. +#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in classification model. +#'Supported options: "skip" (filter out rows with invalid data), +#' "error" (throw an error), "keep" (put invalid data in a special additional --- End diff -- Yes. `error` is the default behavior. The backend code has setDefault. I will reorder it and add the text in the document. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18518: [MINOR][SparkR]: ignore Rplots.pdf test output af...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/18518 [MINOR][SparkR]: ignore Rplots.pdf test output after running R tests ## What changes were proposed in this pull request? After running R tests in local build, it outputs Rplots.pdf. This one should be ignored in the git repository. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark ignore Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18518.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18518 commit 3abca0488da7496cad6038321aac24d1a910670e Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-07-03T21:57:16Z ignore one test output file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18496 I will fix it tonight. It is weird. In my local test, it passed. It seems that my new change doesn't apply to the test. Anyway, I will fix the failure first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18496#discussion_r125154756 --- Diff: R/pkg/R/mllib_tree.R --- @@ -409,7 +413,7 @@ setMethod("spark.randomForest", signature(data = "SparkDataFrame", formula = "fo maxDepth = 5, maxBins = 32, numTrees = 20, impurity = NULL, featureSubsetStrategy = "auto", seed = NULL, subsamplingRate = 1.0, minInstancesPerNode = 1, minInfoGain = 0.0, checkpointInterval = 10, - maxMemoryInMB = 256, cacheNodeIds = FALSE) { + maxMemoryInMB = 256, cacheNodeIds = FALSE, handleInvalid = "error") { --- End diff -- Let me check how to use match.arg(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18496#discussion_r125154735 --- Diff: R/pkg/R/mllib_tree.R --- @@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = "GBTClassificationModel", path = "chara #' nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching #' can speed up training of deeper trees. Users can set how often should the #' cache be checkpointed or disable it by setting checkpointInterval. +#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in classification model. --- End diff -- I think the `labels` means the string label of a feature, which is categorical (e.g., `white`, `black`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17640 @jiangxb1987 The original PR has some issues that are not correctly handled. I will open a new PR when I figure out the right fix. I intended to close this PR. Thanks for closing it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/18496 [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer ## What changes were proposed in this pull request? For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic. This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers. ## How was this patch tested? Add a new unit test based on the error case in the JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark handle Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18496 commit a2cdf511f6ad346efcb81d51f3b805a34063fa0f Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-07-01T04:00:27Z handle unseen labels --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18128 ping @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18128 @felixcheung if I remove `as.integer`, backend doesn't recognize it as `integer`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18128 Local test passed. Let me check it tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18128 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18128#discussion_r119978881 --- Diff: R/pkg/R/mllib_classification.R --- @@ -239,21 +253,64 @@ function(object, path, overwrite = FALSE) { setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula"), function(data, formula, regParam = 0.0, elasticNetParam = 0.0, maxIter = 100, tol = 1E-6, family = "auto", standardization = TRUE, - thresholds = 0.5, weightCol = NULL, aggregationDepth = 2) { + thresholds = 0.5, weightCol = NULL, aggregationDepth = 2, + lowerBoundsOnCoefficients = NULL, upperBoundsOnCoefficients = NULL, + lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts = NULL) { formula <- paste(deparse(formula), collapse = "") +row <- 0 +col <- 0 if (!is.null(weightCol) && weightCol == "") { weightCol <- NULL } else if (!is.null(weightCol)) { weightCol <- as.character(weightCol) } +if (!is.null(lowerBoundsOnIntercepts)) { +lowerBoundsOnIntercepts <- as.array(lowerBoundsOnIntercepts) +} + +if (!is.null(upperBoundsOnIntercepts)) { +upperBoundsOnIntercepts <- as.array(upperBoundsOnIntercepts) +} + +if (!is.null(lowerBoundsOnCoefficients)) { + if (class(lowerBoundsOnCoefficients) != "matrix") { +stop("lowerBoundsOnCoefficients must be a matrix.") + } + row <- nrow(lowerBoundsOnCoefficients) + col <- ncol(lowerBoundsOnCoefficients) + lowerBoundsOnCoefficients <- as.array(as.vector(lowerBoundsOnCoefficients)) +} + +if (!is.null(upperBoundsOnCoefficients)) { + if (class(upperBoundsOnCoefficients) != "matrix") { +stop("upperBoundsOnCoefficients must be a matrix.") + } + + if (!is.null(lowerBoundsOnCoefficients) & (row != nrow(upperBoundsOnCoefficients) +| col != ncol(upperBoundsOnCoefficients))) { +stop(paste("dimension of upperBoundsOnCoefficients ", + "is not the same as lowerBoundsOnCoefficients", sep = "")) + } + + if (is.null(lowerBoundsOnCoefficients)) { +row <- nrow(upperBoundsOnCoefficients) +col <- ncol(upperBoundsOnCoefficients) + } --- End diff -- This is the case where we only set the upperbound. We can set both or either one of them. For the case that both are set. We enforce upperbound and lowerbound are the same dimension, as checked above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18128#discussion_r119911447 --- Diff: R/pkg/R/mllib_classification.R --- @@ -239,21 +253,57 @@ function(object, path, overwrite = FALSE) { setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula"), function(data, formula, regParam = 0.0, elasticNetParam = 0.0, maxIter = 100, tol = 1E-6, family = "auto", standardization = TRUE, - thresholds = 0.5, weightCol = NULL, aggregationDepth = 2) { + thresholds = 0.5, weightCol = NULL, aggregationDepth = 2, + lowerBoundsOnCoefficients = NULL, upperBoundsOnCoefficients = NULL, + lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts = NULL) { formula <- paste(deparse(formula), collapse = "") +lrow <- 0 +lcol <- 0 +urow <- 0 +ucol <- 0 --- End diff -- Oh, I think I can do the check because I have a `NULL` check before enforcing the rule. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/18128#discussion_r119911006 --- Diff: R/pkg/R/mllib_classification.R --- @@ -239,21 +253,57 @@ function(object, path, overwrite = FALSE) { setMethod("spark.logit", signature(data = "SparkDataFrame", formula = "formula"), function(data, formula, regParam = 0.0, elasticNetParam = 0.0, maxIter = 100, tol = 1E-6, family = "auto", standardization = TRUE, - thresholds = 0.5, weightCol = NULL, aggregationDepth = 2) { + thresholds = 0.5, weightCol = NULL, aggregationDepth = 2, + lowerBoundsOnCoefficients = NULL, upperBoundsOnCoefficients = NULL, + lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts = NULL) { formula <- paste(deparse(formula), collapse = "") +lrow <- 0 +lcol <- 0 +urow <- 0 +ucol <- 0 --- End diff -- Question: Based on my understanding, `lowerBoundsOnCoefficients ` and `upperBoundsOnCoefficients ` are not required to set at the same time. They can be set at the same time. For the first case, we can't enforce the dimension of the two matrices because one could be `NULL`. For the second case, we can check it. So, we can't enforce the rule in general. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/18128 @yanboliang Can you take a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/18128 [SPARK-20906][SparkR]:Constrained Logistic Regression for SparkR ## What changes were proposed in this pull request? PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR. ## How was this patch tested? Add new unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18128.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18128 commit 1fc68f69ecce46c8d4c2bbd2d9aafdd042c27108 Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-05-27T06:27:04Z add constraint logit commit 7627ac9c093ba72afd586c3ea1e482238d29c3c3 Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-05-27T07:29:25Z add unit test and doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116345383 --- Diff: R/pkg/DESCRIPTION --- @@ -42,6 +42,7 @@ Collate: 'functions.R' 'install.R' 'jvm.R' +'mllib_wrapper.R' --- End diff -- Can you make it lexicographic order? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116345166 --- Diff: R/pkg/R/mllib_regression.R --- @@ -360,6 +338,7 @@ setMethod("spark.isoreg", signature(data = "SparkDataFrame", formula = "formula" # Get the summary of an IsotonicRegressionModel model +#' @param object a fitted IsotonicRegressionModel. --- End diff -- You use capital A below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116345323 --- Diff: R/pkg/R/mllib_wrapper.R --- @@ -0,0 +1,61 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' S4 class that represents a Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model +#' @export +#' @note JavaModel since 2.3.0 +setClass("JavaModel", representation(jobj = "jobj")) + +#' Makes predictions from a Java ML model +#' +#' @param object a Spark ML model. +#' @param newData a SparkDataFrame for testing. +#' @return \code{predict} returns a SparkDataFrame containing predicted value. +#' @rdname spark.predict +#' @aliases predict,JavaModel-method +#' @export +#' @note predict since 2.3.0 +setMethod("predict", signature(object = "JavaModel"), + function(object, newData) { +predict_internal(object, newData) + }) + +#' S4 class that represents a writable Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model +#' @export +#' @note JavaMLWritable since 2.3.0 +setClass("JavaMLWritable", representation(jobj = "jobj")) + +# Save the ML model to the output path. + +#' @param object A fitted ML model. +#' @param path The directory where the model is saved. +#' @param overwrite Overwrites or not if the output path already exists. Default is FALSE --- End diff -- `O` -> `o` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116345209 --- Diff: R/pkg/R/mllib_wrapper.R --- @@ -0,0 +1,61 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' S4 class that represents a Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model --- End diff -- `backing` -> `backend`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116345283 --- Diff: R/pkg/R/mllib_wrapper.R --- @@ -0,0 +1,61 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' S4 class that represents a Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model +#' @export +#' @note JavaModel since 2.3.0 +setClass("JavaModel", representation(jobj = "jobj")) + +#' Makes predictions from a Java ML model +#' +#' @param object a Spark ML model. +#' @param newData a SparkDataFrame for testing. +#' @return \code{predict} returns a SparkDataFrame containing predicted value. +#' @rdname spark.predict +#' @aliases predict,JavaModel-method +#' @export +#' @note predict since 2.3.0 +setMethod("predict", signature(object = "JavaModel"), + function(object, newData) { +predict_internal(object, newData) + }) + +#' S4 class that represents a writable Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model +#' @export +#' @note JavaMLWritable since 2.3.0 +setClass("JavaMLWritable", representation(jobj = "jobj")) + +# Save the ML model to the output path. + +#' @param object A fitted ML model. --- End diff -- `A` -> `a` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116344992 --- Diff: R/pkg/R/mllib_classification.R --- @@ -22,29 +22,36 @@ #' #' @param jobj a Java object reference to the backing Scala LinearSVCModel #' @export +#' @include mllib_wrapper.R #' @note LinearSVCModel since 2.2.0 -setClass("LinearSVCModel", representation(jobj = "jobj")) +setClass("LinearSVCModel", representation(jobj = "jobj"), + contains = c("JavaModel", "JavaMLWritable")) #' S4 class that represents an LogisticRegressionModel #' #' @param jobj a Java object reference to the backing Scala LogisticRegressionModel #' @export #' @note LogisticRegressionModel since 2.1.0 --- End diff -- Missing '#' @include mllib_wrapper.R'? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116344933 --- Diff: R/pkg/R/generics.R --- @@ -1535,9 +1535,7 @@ setGeneric("spark.freqItemsets", function(object) { standardGeneric("spark.freqI #' @export setGeneric("spark.associationRules", function(object) { standardGeneric("spark.associationRules") }) -#' @param object a fitted ML model object. --- End diff -- why remove the three lines? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17808: [SPARK-20533][SparkR]:SparkR Wrappers Model should be pr...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17808 I think we don't have to back-port. This is a small improvement/optimization of the original code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17808: [SPARK-20533][SparkR]:SparkR Wrappers Model shoul...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/17808 [SPARK-20533][SparkR]:SparkR Wrappers Model should be private and value should be lazy ## What changes were proposed in this pull request? MultilayerPerceptronClassifierWrapper model should be private. LogisticRegressionWrapper.scala rFeatures and rCoefficients should be lazy. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark lazy Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17808.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17808 commit c1eaca911bf4aa4315929eda6ea6e7f6ceff04f4 Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-04-29T16:49:14Z change private and lazy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17805: [SPARK-20477][SparkR][DOC]: Document R bisecting k-means...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17805 cc @felixcheung This is a similar documentation change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17805: [SparkR][DOC][SPARK-20477]: Document R bisecting ...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/17805 [SparkR][DOC][SPARK-20477]: Document R bisecting k-means in R programming guide ## What changes were proposed in this pull request? Add hyper link in the SparkR programming guide. ## How was this patch tested? Build doc and manually check the doc link. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17805.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17805 commit 540bf7a34dcb7db0892e3cadf24b0c01364162f2 Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-04-28T17:02:04Z add spark.bisectingKmeans doc in the programming guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113974703 --- Diff: R/pkg/R/serialize.R --- @@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) { Date = writeDate(con, object), POSIXlt = writeTime(con, object), POSIXct = writeTime(con, object), + bigint = writeDouble(con, object), --- End diff -- For completeness purpose, I think we can keep the write logic in R side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113972686 --- Diff: R/pkg/R/serialize.R --- @@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) { Date = writeDate(con, object), POSIXlt = writeTime(con, object), POSIXct = writeTime(con, object), + bigint = writeDouble(con, object), --- End diff -- When using createDataFrame, R uses `serialize` to send data to the backend. When taking an action, say, `collect`, scala side logic refers to the schema field and calls the `readTypedObjects` where the newly added read logic kicks in. When it returns back to R side, the newly added write logic kicks in and R side can interpret it due to the R side read logic. It seems that the `write` logic in R side is not called, because we don't have specific type `bigint` in R. Right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17797: [SparkR][DOC]:Document LinearSVC in R programming guide
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17797 @felixcheung As I checked the SparkR programming guide, it seems that all machine learning parts are links to existing documents. So I just add the link to Linear SVM document and tested it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113853483 --- Diff: R/pkg/R/serialize.R --- @@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) { Date = writeDate(con, object), POSIXlt = writeTime(con, object), POSIXct = writeTime(con, object), + bigint = writeDouble(con, object), --- End diff -- I see. But as you mentioned, we don't know how to trigger the write path on the R side, because both bigint and double are `numeric`. I think we can just remove the test in the R side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17797: [SparkR][DOC]:Document LinearSVC in R programming...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/17797 [SparkR][DOC]:Document LinearSVC in R programming guide ## What changes were proposed in this pull request? add link to svmLinear in the SparkR programming document. ## How was this patch tested? Build doc manually and click the link to the document. It looks good. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17797.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17797 commit 3a59cc2a1741a2dae6f20fa71e689a0dcc16c835 Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-04-28T05:07:46Z add link to linear svc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113823246 --- Diff: R/pkg/R/serialize.R --- @@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) { Date = writeDate(con, object), POSIXlt = writeTime(con, object), POSIXct = writeTime(con, object), + bigint = writeDouble(con, object), --- End diff -- @felixcheung Any thoughts? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113586516 --- Diff: R/pkg/R/serialize.R --- @@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) { Date = writeDate(con, object), POSIXlt = writeTime(con, object), POSIXct = writeTime(con, object), + bigint = writeDouble(con, object), --- End diff -- If R doesn't have `bigint` type, we should remove all `bigint` related logic. I don't know the history of `bigint` mapping in the Types.R file. Why should we have it since every big number is numeric (Double in the backend)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113585851 --- Diff: R/pkg/R/serialize.R --- @@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) { Date = writeDate(con, object), POSIXlt = writeTime(con, object), POSIXct = writeTime(con, object), + bigint = writeDouble(con, object), --- End diff -- When specifying schema with `bigint`, we will hit the bigint path. Without this change, it will thrown an error of type mismatch. But as you said, we can't specify `bigint` type in R console. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113362108 --- Diff: R/pkg/inst/tests/testthat/test_Serde.R --- @@ -28,6 +28,10 @@ test_that("SerDe of primitive types", { expect_equal(x, 1) expect_equal(class(x), "numeric") + x <- callJStatic("SparkRHandler", "echo", 1380742793415240) --- End diff -- I did some google search. R can't specify `bigint` type. So, we can't directly test `bigint` type. We can remove the tests above, as we added `schema` tests and scala API tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113358460 --- Diff: R/pkg/inst/tests/testthat/test_Serde.R --- @@ -28,6 +28,10 @@ test_that("SerDe of primitive types", { expect_equal(x, 1) expect_equal(class(x), "numeric") + x <- callJStatic("SparkRHandler", "echo", 1380742793415240) --- End diff -- I don't know how to specify in R console to enforce bigint type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17640#discussion_r113358355 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -3043,6 +3043,23 @@ test_that("catalog APIs, currentDatabase, setCurrentDatabase, listDatabases", { expect_equal(dbs[[1]], "default") }) +test_that("dapply with bigint type", { + df <- createDataFrame( +list(list(1380742793415240, 1, "1"), list(1380742793415240, 2, "2"), +list(1380742793415240, 3, "3")), c("a", "b", "c")) + schema <- structType(structField("a", "bigint"), structField("b", "bigint"), --- End diff -- This one tests bigint --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17754: [FollowUp][SPARK-18901][ML]: Require in LR Logist...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/17754 [FollowUp][SPARK-18901][ML]: Require in LR LogisticAggregator is redundant ## What changes were proposed in this pull request? This is a follow-up PR of #17478. ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17754.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17754 commit dbff96111fd00c2127afe2a46515efc163aa36b8 Author: wangmiao1981 <wm...@hotmail.com> Date: 2017-04-25T00:11:08Z remove extra require check --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17478: [SPARK-18901][ML]:Require in LR LogisticAggregator is re...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17478 @yanboliang I will do it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17640 @felixcheung I just came back from vacation. I will make changes now. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17640 I am adding more tests right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17640 Based on my understanding, it does not directly solvethe 12360. This one just solves the serialization of a specific type `bigint` in struct field. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17640 For `Inf` case, I used a very large number: 1380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013 80742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17640 If I use very big number, then sparkR shell will get the following output: > collect(df1) a b cd 1 Inf 1 1 Inf So the overflow problem has been taken care of in the Scala side. We don't have to add additional handling in R side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17640 cc @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17640 I will some bound check and error handling. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/17640 [SPARK-17608][SPARKR]:Long type has incorrect serialization/deserialization ## What changes were proposed in this pull request? `bigint` is not supported in schema and the serialization is not `Double`. Add `bigint` support in schema and serialized and deserialized as `Double`. This fix is orthogonal to the precision problem in https://issues.apache.org/jira/browse/SPARK-12360 ## How was this patch tested? Add a new unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark summary Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17640.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17640 commit 03b82ac19dcbe17a70d9e45790dd24210b6d4f07 Author: wm...@hotmail.com <wm...@hotmail.com> Date: 2017-04-14T17:43:35Z add bigint support --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17611: [SPARK-20298][SparkR][MINOR] fixed spelling mistake "cha...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17611 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17611: [SPARK-20298][SparkR][MINOR] fixed spelling mistake "cha...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17611 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17478: [SPARK-18901][ML]:Require in LR LogisticAggregator is re...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/17478 @sethah Thanks for your reply! Your suggestion makes sense to me. My intention was to close the JIRA by simple fix. How about we add a test for these checks and close the original JIRA? or you think just mark that JIRA as WON'T Fix? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org