[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-19 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r182829257
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
--- End diff --

OK. It makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-18 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r182603253
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
--- End diff --

there is only one lossType. `for` is not necessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21090: [SPARK-15784][ML] Add Power Iteration Clustering ...

2018-04-17 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21090#discussion_r182254888
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,256 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types._
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be  1. Default: 2.
+   * @group param
+   */
+  @Since("2.4.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use a normalized sum of 
similarities with other vertices.
+   * Default: random.
+   * @group expertParam
+   */
+  @Since("2.4.0")
+  final val initMode = {
+val allowedParams = ParamValidators.inArray(Array("random", "degree"))
+new Param[String](this, "initMode", "The initialization algorithm. 
This can be either " +
+  "'random' to use a random vector as vertex properties, or 'degree' 
to use a normalized sum " +
+  "of similarities with other vertices.  Supported options: 'random' 
and 'degree'.",
+  allowedParams)
+  }
+
+  /** @group expertGetParam */
+  @Since("2.4.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the name of the input column for vertex IDs.
+   * Default: "id"
+   * @group param
+   */
+  @Since("2.4.0")
+  val idCol = new Param[String](this, "idCol", "Name of the input column 
for vertex IDs.",
+(value: String) => value.nonEmpty)
+
+  setDefault(idCol, "id")
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getIdCol: String = getOrDefault(idCol)
+
+  /**
+   * Param for the name of the input column for neighbors in the adjacency 
list representation.
+   * Default: "neighbors"
+   * @group param
+   */
+  @Since("2.4.0")
+  val neighborsCol = new Param[String](this, "neighborsCol",
+"Name of the input column for neighbors in the adjacency list 
representation.",
+(value: String) => value.nonEmpty)
+
+  setDefault(neighborsCol, "neighbors")
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getNeighborsCol: String = $(neighborsCol)
+
+  /**
+   * Param for the name of the input column for neighbors in the adjacency 
list representation.
+   * Default: "similarities"
+   * @group param
+   */
+  @Since("2.4.0")
+  val similaritiesCol = new Param[String](this, "similaritiesCol",
+"Name of the input column for neighbors in the adjacency list 
representation.",
+(value: String) => value.nonEmpty)
+
+  setDefault(similaritiesCol, "sim

[GitHub] spark issue #21090: [SPARK-15784][ML] Add Power Iteration Clustering to spar...

2018-04-17 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/21090
  
Take a quick look. Despite of the style failure and a minor format issue, 
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21090: [SPARK-15784][ML] Add Power Iteration Clustering ...

2018-04-17 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21090#discussion_r182243819
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/PowerIterationClusteringSuite.scala
 ---
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import scala.collection.mutable
+
+import org.apache.spark.ml.util.DefaultReadWriteTest
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
+import org.apache.spark.{SparkException, SparkFunSuite}
+
+
+class PowerIterationClusteringSuite extends SparkFunSuite
+  with MLlibTestSparkContext with DefaultReadWriteTest {
+
+  @transient var data: Dataset[_] = _
+  final val r1 = 1.0
+  final val n1 = 10
+  final val r2 = 4.0
+  final val n2 = 40
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+data = PowerIterationClusteringSuite.generatePICData(spark, r1, r2, 
n1, n2)
+  }
+
+  test("default parameters") {
+val pic = new PowerIterationClustering()
+
+assert(pic.getK === 2)
+assert(pic.getMaxIter === 20)
+assert(pic.getInitMode === "random")
+assert(pic.getPredictionCol === "prediction")
+assert(pic.getIdCol === "id")
+assert(pic.getNeighborsCol === "neighbors")
+assert(pic.getSimilaritiesCol === "similarities")
+  }
+
+  test("parameter validation") {
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setInitMode("no_such_a_mode")
+}
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setIdCol("")
+}
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setNeighborsCol("")
+}
+intercept[IllegalArgumentException] {
+  new PowerIterationClustering().setSimilaritiesCol("")
+}
+  }
+
+  test("power iteration clustering") {
+val n = n1 + n2
+
+val model = new PowerIterationClustering()
+  .setK(2)
+  .setMaxIter(40)
+val result = model.transform(data)
+
+val predictions = Array.fill(2)(mutable.Set.empty[Long])
+result.select("id", "prediction").collect().foreach {
+  case Row(id: Long, cluster: Integer) => predictions(cluster) += id
+}
+assert(predictions.toSet == Set((1 until n1).toSet, (n1 until 
n).toSet))
+
+val result2 = new PowerIterationClustering()
+  .setK(2)
+  .setMaxIter(10)
+  .setInitMode("degree")
+  .transform(data)
+val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
+result2.select("id", "prediction").collect().foreach {
+  case Row(id: Long, cluster: Integer) => predictions2(cluster) += id
+}
+assert(predictions2.toSet == Set((1 until n1).toSet, (n1 until 
n).toSet))
+  }
+
+  test("supported input types") {
+val model = new PowerIterationClustering()
+  .setK(2)
+  .setMaxIter(1)
+
+def runTest(idType: DataType, neighborType: DataType, similarityType: 
DataType): Unit = {
+  val typedData = data.select(
+col("id").cast(idType).alias("id"),
+col("neighbors").cast(ArrayType(neighborType, containsNull = 
false)).alias("neighbors"),
+col("similarities").cast(ArrayType(similarityType, containsNull = 
false))
+  .alias("similarities")
+  )

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2018-04-17 Thread wangmiao1981
Github user wangmiao1981 closed the pull request at:

https://github.com/apache/spark/pull/15770


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2018-04-17 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@jkbradley I close this one now. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2018-04-17 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@jkbradley Sorry for missing your comments. Anyway, I will close it now. I 
will choose another one to work on. Thanks! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2018-01-03 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
ping @yanboliang 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-11-21 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
ping @yanboliang 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-11-09 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@weichenXu123 Any other comments? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-11-01 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@WeichenXu123 Thanks for your review and reply! I agree with you that the 
helper can be discussed later for potential enhancement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-10-31 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@WeichenXu123 , for the graph helper, the Mllib has a version takes 
`Graph[Double, Double]` as a parameter for training. In ML, do we have to 
provide `DataSet` of `Graph`? Can you specify the requirement? I have addressed 
your other comments. Thanks! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-10-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r143078744
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol with HasWeightCol {
+
+  /**
+   * The number of clusters to create (k). Must be  1. Default: 2.
+   * @group param
+   */
+  @Since("2.3.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.3.0")
+  final val initMode = {
+val allowedParams = ParamValidators.inArray(Array("random", "degree"))
+new Param[String](this, "initMode", "The initialization algorithm. " +
+  "Supported options: 'random' and 'degree'.", allowedParams)
+  }
+
+  /** @group expertGetParam */
+  @Since("2.3.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
PowerIterationClustering.transform().
+   * Default: "id"
+   * @group param
+   */
+  @Since("2.3.0")
+  val idCol = new Param[String](this, "id", "column name for ids.")
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Param for the column name for neighbors required by 
PowerIterationClustering.transform().
+   * Default: "neighbor"
+   * @group param
+   */
+  @Since("2.3.0")
+  val neighborCol = new Param[String](this, "neighbor", "column name for 
neighbors.")
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getNeighborCol: String = $(neighborCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+ * PIC finds a very low-dimensional embedding of a dataset using truncated 
power
+ * iteration on a normalized pair-wise similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is 

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-10-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r143078479
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol with HasWeightCol {
+
+  /**
+   * The number of clusters to create (k). Must be  1. Default: 2.
+   * @group param
+   */
+  @Since("2.3.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.3.0")
+  final val initMode = {
+val allowedParams = ParamValidators.inArray(Array("random", "degree"))
+new Param[String](this, "initMode", "The initialization algorithm. " +
+  "Supported options: 'random' and 'degree'.", allowedParams)
+  }
+
+  /** @group expertGetParam */
+  @Since("2.3.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
PowerIterationClustering.transform().
+   * Default: "id"
+   * @group param
+   */
+  @Since("2.3.0")
+  val idCol = new Param[String](this, "id", "column name for ids.")
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Param for the column name for neighbors required by 
PowerIterationClustering.transform().
+   * Default: "neighbor"
+   * @group param
+   */
+  @Since("2.3.0")
+  val neighborCol = new Param[String](this, "neighbor", "column name for 
neighbors.")
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getNeighborCol: String = $(neighborCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+ * PIC finds a very low-dimensional embedding of a dataset using truncated 
power
+ * iteration on a normalized pair-wise similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is 

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-09-15 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
I will address the review comments soon. Thanks! @WeichenXu123 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-09-08 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
ping @WeichenXu123


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-09-08 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
ping @WeichenXu123


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-08-19 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@WeichenXu123 I have made changes based on your comments. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-08-16 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-08-15 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
info] Main Scala API documentation successful.
[error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
[error] Total time: 95 s, completed Aug 15, 2017 4:59:59 PM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt 
-Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc ; 
received return code 1

It seems irrelevant. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-08-15 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-08-15 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
Jenkins, retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-08-15 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
weird. Local style test passed. Anyway, I changed the order as required by 
Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-08-15 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r133271527
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol with HasWeightCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = {
+val allowedParams = ParamValidators.inArray(Array("random", "degree"))
+new Param[String](this, "initMode", "The initialization algorithm. " +
+  "Supported options: 'random' and 'degree'.", allowedParams)
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "id", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Param for the column name for neighbors required by 
[[PowerIterationClustering.transform()]].
+   * Default: "neighbor"
+   * @group param
+   */
+  val neighborCol = new Param[String](this, "neighbor", "column name for 
neighbors.")
+
+  /** @group getParam */
+  def getNeighborCol: String = $(neighborCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+ * PIC finds a very low-dimensional embedding of a dataset using truncated 
power
+ * iteration on a normalized pair-wise similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is an
+ * expensive operation, because it uses PIC algorithm to cluster the whole 
input dataset.
+ *
+ * @see http:/

[GitHub] spark pull request #15770: [SPARK-15784][ML]:Add Power Iteration Clustering ...

2017-08-15 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15770#discussion_r133267575
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.Assignment
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasFeaturesCol with HasPredictionCol with HasWeightCol {
+
+  /**
+   * The number of clusters to create (k). Must be > 1. Default: 2.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use normalized sum similarities. 
Default: random.
+   */
+  @Since("2.2.0")
+  final val initMode = {
+val allowedParams = ParamValidators.inArray(Array("random", "degree"))
+new Param[String](this, "initMode", "The initialization algorithm. " +
+  "Supported options: 'random' and 'degree'.", allowedParams)
+  }
+
+  /** @group expertGetParam */
+  @Since("2.2.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the column name for ids returned by 
[[PowerIterationClustering.transform()]].
+   * Default: "id"
+   * @group param
+   */
+  val idCol = new Param[String](this, "id", "column name for ids.")
+
+  /** @group getParam */
+  def getIdCol: String = $(idCol)
+
+  /**
+   * Param for the column name for neighbors required by 
[[PowerIterationClustering.transform()]].
+   * Default: "neighbor"
+   * @group param
+   */
+  val neighborCol = new Param[String](this, "neighbor", "column name for 
neighbors.")
+
+  /** @group getParam */
+  def getNeighborCol: String = $(neighborCol)
+
+  /**
+   * Validates the input schema
+   * @param schema input schema
+   */
+  protected def validateSchema(schema: StructType): Unit = {
+SchemaUtils.checkColumnType(schema, $(idCol), LongType)
+SchemaUtils.checkColumnType(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Power Iteration Clustering (PIC), a scalable graph clustering algorithm 
developed by
+ * http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+ * PIC finds a very low-dimensional embedding of a dataset using truncated 
power
+ * iteration on a normalized pair-wise similarity matrix of the data.
+ *
+ * Note that we implement [[PowerIterationClustering]] as a transformer. 
The [[transform]] is an
+ * expensive operation, because it uses PIC algorithm to cluster the whole 
input dataset.
+ *
+ * @see http:/

[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-08-10 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@WeichenXu123 Thanks for reviewing! I will address the comments soon.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...

2017-07-24 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18605
  
@felixcheung Can you take a look? Thanks!




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...

2017-07-20 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18605
  
@yanboliang I have made changes accordingly. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...

2017-07-18 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18605
  
@yanboliang Thanks for your reply! I will change the unit tests now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...

2017-07-17 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18605
  
@yanboliang after #18613, unit tests fails if "skip" is used.

For example,
data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
  someString = base::sample(c("this", "that"), 10, replace = TRUE),
stringsAsFactors = FALSE)
  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
  traindf <- as.DataFrame(data[trainidxs, ])
  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
  model <- spark.mlp(traindf, clicked ~ ., layers = c(1, 3), handleInvalid 
= "keep")
  predictions <- predict(model, testdf)
  expect_equal(class(collect(predictions)$clicked[1]), "character")

It fails the as if "error" is used.

If I change "skip" to "keep", then the predictions$click[0] is NULL.
> collect(predictions)
[1] clickedsomeString prediction
<0 rows> (or 0-length row.names)
> collect(predictions)$click[1]
[[1]]
NULL

I am not sure whether this is expected or there is a bug.

Before, the units work fine.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...

2017-07-15 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18605
  
Sure. I am reading the #18613 comments. Just come back from a business 
travel. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18613: [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should han...

2017-07-12 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18613
  
@felixcheung I agree. We should make changes in Scala side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...

2017-07-12 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18605
  
Trigger windows check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...

2017-07-12 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18605
  
Reopen for windows check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleIn...

2017-07-12 Thread wangmiao1981
GitHub user wangmiao1981 reopened a pull request:

https://github.com/apache/spark/pull/18605

[SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid for classification 
algorithms

## What changes were proposed in this pull request?

SPARK-20307 Added handleInvalid option to RFormula for tree-based 
classification algorithms. We should add this parameter for other 
classification algorithms in SparkR.

This is a followup PR for SPARK-20307.

## How was this patch tested?

New Unit tests are added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark class

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18605.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18605






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleIn...

2017-07-12 Thread wangmiao1981
Github user wangmiao1981 closed the pull request at:

https://github.com/apache/spark/pull/18605


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid f...

2017-07-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18605
  
@felixcheung This is a follow-up PR of JIRA-20307.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18605: [SparkR][SPARK-21381]:SparkR: pass on setHandleIn...

2017-07-11 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/18605

[SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid for classification 
algorithms

## What changes were proposed in this pull request?

SPARK-20307 Added handleInvalid option to RFormula for tree-based 
classification algorithms. We should add this parameter for other 
classification algorithms in SparkR.

This is a followup PR for SPARK-20307.

## How was this patch tested?

New Unit tests are added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark class

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18605.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18605


commit 77b04a37e93d6967def24c0a8265ed784875f5b0
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-07-12T00:40:58Z

add handleInvalid for classifications




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-07-08 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18496
  
#14850 is the PR printing the full stack. We can improve it by print the 
cause instead of print stack.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-07-08 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18496
  
I will review all classifiers to add the handleInvalid when necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-07-08 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18496
  
Actually, the udf in transform() of StringIndexer.scala, will throw an 
exception in action. But, it doesn't stop the execution of collect().

val indexer = udf { label: String =>
  if (label == null) {
if (keepInvalid) {
  labels.length
} else {
  throw new SparkException("StringIndexer encountered NULL value. 
To handle or skip " +
"NULLS, try setting StringIndexer.handleInvalid.")
}
  } else {
if (labelToIndex.contains(label)) {
  labelToIndex(label)
} else if (keepInvalid) {
  labels.length
} else {
  throw new SparkException(s"Unseen label: $label.  To handle 
unseen labels, " +
s"set Param handleInvalid to ${StringIndexer.KEEP_INVALID}.") 
<=== this is the exception.
}
  }
}

I am asking other people who are familiar with this logic to understand why 
it doesn't stop the collect().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-07-07 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18496
  
I did a quick debug:
In DataSet.scala
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame 
= {
val qe = sparkSession.sessionState.executePlan(logicalPlan)< This 
line throws
Method threw 'org.apache.spark.SparkException' exception. Cannot evaluate 
org.apache.spark.sql.execution.QueryExecution.toString()




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-07-07 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18496
  
@felixcheung Yes. I think we can improve scala side. It only throws 
exception when a `NULL` field is given. For unseen labels, as the example 
above, it always fails at the same place `double` to `string`. The scala side 
doesn't capture this exception and let it go into the handling logic to cause 
the failure. I will try to address it in a follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...

2017-07-07 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15770
  
@yanboliang Can you take a look first? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-07 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r126198035
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

On Scala side, I created a case where unseen label is used in test data:
`val data: Seq[(Int, String)] = Seq((0, "a"), (1, "b"), (2, "b"), (3, null))
val data2: Seq[(Int, String)] = Seq((0, "a"), (1, "b"), (3, "d"))
val df = data.toDF("id", "label")
val df2 = data2.toDF("id", "label")

val indexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("labelIndex")

indexer.setHandleInvalid("error")
indexer.fit(df).transform(df2).collect()
`
It also fails with same error message as R case. I think it is the expected 
behavior for `"error"`.

Failed Messages:

Failed to execute user defined function($anonfun$9: (string) => double)
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$9: (string) => double)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:139)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-06 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125954907
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

Let me check how "error" option is handled. It seems that there is no 
exception thrown out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125802201
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

The console prints out :

Error in handleErrors(returnStatus, conn) : 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 
(TID 13, localhost, executor driver): org.apache.spark.SparkException: Failed 
to execute user defined function($anonfun$9: (string) => double)

Shall I match this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125703030
  
--- Diff: R/pkg/tests/fulltests/test_mllib_tree.R ---
@@ -212,6 +212,23 @@ test_that("spark.randomForest", {
   expect_equal(length(grep("1.0", predictions)), 50)
   expect_equal(length(grep("2.0", predictions)), 50)
 
+  # Test unseen labels
+  data <- data.frame(clicked = base::sample(c(0, 1), 10, replace = TRUE),
+someString = base::sample(c("this", "that"), 10, 
replace = TRUE),
+stringsAsFactors = FALSE)
+  trainidxs <- base::sample(nrow(data), nrow(data) * 0.7)
+  traindf <- as.DataFrame(data[trainidxs, ])
+  testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other")))
+  model <- spark.randomForest(traindf, clicked ~ ., type = 
"classification",
+  maxDepth = 10, maxBins = 10, numTrees = 10)
+  predictions <- predict(model, testdf)
+  expect_error(collect(predictions))
--- End diff --

The training call has no error because it has no unseen label. 

I think the internal has logic handling unseen label but when doing 
collection (action), it can't map the internal value to the unseen label. That 
is the reason why it only fails when doing collection.

I will add the error string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-07-05 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125702340
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = 
"GBTClassificationModel", path = "chara
 #' nodes. If TRUE, the algorithm will cache node IDs 
for each instance. Caching
 #' can speed up training of deeper trees. Users can 
set how often should the
 #' cache be checkpointed or disable it by setting 
checkpointInterval.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in classification model.
+#'Supported options: "skip" (filter out rows with invalid data),
+#'   "error" (throw an error), "keep" (put invalid 
data in a special additional
--- End diff --

Yes. `error` is the default behavior. The backend code has setDefault. I 
will reorder it and add the text in the document.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18518: [MINOR][SparkR]: ignore Rplots.pdf test output af...

2017-07-03 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/18518

[MINOR][SparkR]: ignore Rplots.pdf test output after running R tests

## What changes were proposed in this pull request?

After running R tests in local build, it outputs Rplots.pdf. This one 
should be ignored in the git repository.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark ignore

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18518


commit 3abca0488da7496cad6038321aac24d1a910670e
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-07-03T21:57:16Z

ignore one test output file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-07-01 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18496
  
I will fix it tonight. It is weird. In my local test, it passed. It seems 
that my new change doesn't apply to the test. Anyway, I will fix the failure 
first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154756
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -409,7 +413,7 @@ setMethod("spark.randomForest", signature(data = 
"SparkDataFrame", formula = "fo
maxDepth = 5, maxBins = 32, numTrees = 20, impurity = 
NULL,
featureSubsetStrategy = "auto", seed = NULL, 
subsamplingRate = 1.0,
minInstancesPerNode = 1, minInfoGain = 0.0, 
checkpointInterval = 10,
-   maxMemoryInMB = 256, cacheNodeIds = FALSE) {
+   maxMemoryInMB = 256, cacheNodeIds = FALSE, 
handleInvalid = "error") {
--- End diff --

Let me check how to use match.arg().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154735
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = 
"GBTClassificationModel", path = "chara
 #' nodes. If TRUE, the algorithm will cache node IDs 
for each instance. Caching
 #' can speed up training of deeper trees. Users can 
set how often should the
 #' cache be checkpointed or disable it by setting 
checkpointInterval.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in classification model.
--- End diff --

I think the `labels` means the string label of a feature, which is 
categorical (e.g., `white`, `black`). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-06-30 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
@jiangxb1987 The original PR has some issues that are not correctly 
handled. I will open a new PR when I figure out the right fix. I intended to 
close this PR. Thanks for closing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/18496

[SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid to spark.mllib 
functions that use StringIndexer

## What changes were proposed in this pull request?

For randomForest classifier, if test data contains unseen labels, it will 
throw an error. The StringIndexer already has the handleInvalid logic. The 
patch add a new method to set the underlying StringIndexer handleInvalid logic.

This patch should also apply to other classifiers. This PR focuses on the 
main logic and randomForest classifier. I will do follow-up PR for other 
classifiers.

## How was this patch tested?

Add a new unit test based on the error case in the JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark handle

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18496


commit a2cdf511f6ad346efcb81d51f3b805a34063fa0f
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-07-01T04:00:27Z

handle unseen labels




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...

2017-06-13 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18128
  
ping @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...

2017-06-03 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18128
  
@felixcheung if I remove `as.integer`, backend doesn't recognize it as 
`integer`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...

2017-06-03 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18128
  
Local test passed. Let me check it tonight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...

2017-06-03 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18128
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...

2017-06-02 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18128#discussion_r119978881
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -239,21 +253,64 @@ function(object, path, overwrite = FALSE) {
 setMethod("spark.logit", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, regParam = 0.0, elasticNetParam = 0.0, 
maxIter = 100,
tol = 1E-6, family = "auto", standardization = TRUE,
-   thresholds = 0.5, weightCol = NULL, aggregationDepth = 
2) {
+   thresholds = 0.5, weightCol = NULL, aggregationDepth = 
2,
+   lowerBoundsOnCoefficients = NULL, 
upperBoundsOnCoefficients = NULL,
+   lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts 
= NULL) {
 formula <- paste(deparse(formula), collapse = "")
+row <- 0
+col <- 0
 
 if (!is.null(weightCol) && weightCol == "") {
   weightCol <- NULL
 } else if (!is.null(weightCol)) {
   weightCol <- as.character(weightCol)
 }
 
+if (!is.null(lowerBoundsOnIntercepts)) {
+lowerBoundsOnIntercepts <- 
as.array(lowerBoundsOnIntercepts)
+}
+
+if (!is.null(upperBoundsOnIntercepts)) {
+upperBoundsOnIntercepts <- 
as.array(upperBoundsOnIntercepts)
+}
+
+if (!is.null(lowerBoundsOnCoefficients)) {
+  if (class(lowerBoundsOnCoefficients) != "matrix") {
+stop("lowerBoundsOnCoefficients must be a matrix.")
+  }
+  row <- nrow(lowerBoundsOnCoefficients)
+  col <- ncol(lowerBoundsOnCoefficients)
+  lowerBoundsOnCoefficients <- 
as.array(as.vector(lowerBoundsOnCoefficients))
+}
+
+if (!is.null(upperBoundsOnCoefficients)) {
+  if (class(upperBoundsOnCoefficients) != "matrix") {
+stop("upperBoundsOnCoefficients must be a matrix.")
+  }
+
+  if (!is.null(lowerBoundsOnCoefficients) & (row != 
nrow(upperBoundsOnCoefficients)
+| col != ncol(upperBoundsOnCoefficients))) {
+stop(paste("dimension of upperBoundsOnCoefficients ",
+   "is not the same as lowerBoundsOnCoefficients", 
sep = ""))
+  }
+
+  if (is.null(lowerBoundsOnCoefficients)) {
+row <- nrow(upperBoundsOnCoefficients)
+col <- ncol(upperBoundsOnCoefficients)
+  }
--- End diff --

This is the case where we only set the upperbound. We can set both or 
either one of them.

For the case that both are set. We enforce upperbound and lowerbound are 
the same dimension, as checked above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...

2017-06-02 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18128#discussion_r119911447
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -239,21 +253,57 @@ function(object, path, overwrite = FALSE) {
 setMethod("spark.logit", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, regParam = 0.0, elasticNetParam = 0.0, 
maxIter = 100,
tol = 1E-6, family = "auto", standardization = TRUE,
-   thresholds = 0.5, weightCol = NULL, aggregationDepth = 
2) {
+   thresholds = 0.5, weightCol = NULL, aggregationDepth = 
2,
+   lowerBoundsOnCoefficients = NULL, 
upperBoundsOnCoefficients = NULL,
+   lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts 
= NULL) {
 formula <- paste(deparse(formula), collapse = "")
+lrow <- 0
+lcol <- 0
+urow <- 0
+ucol <- 0
--- End diff --

Oh, I think I can do the check because I have a `NULL` check before 
enforcing the rule.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...

2017-06-02 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18128#discussion_r119911006
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -239,21 +253,57 @@ function(object, path, overwrite = FALSE) {
 setMethod("spark.logit", signature(data = "SparkDataFrame", formula = 
"formula"),
   function(data, formula, regParam = 0.0, elasticNetParam = 0.0, 
maxIter = 100,
tol = 1E-6, family = "auto", standardization = TRUE,
-   thresholds = 0.5, weightCol = NULL, aggregationDepth = 
2) {
+   thresholds = 0.5, weightCol = NULL, aggregationDepth = 
2,
+   lowerBoundsOnCoefficients = NULL, 
upperBoundsOnCoefficients = NULL,
+   lowerBoundsOnIntercepts = NULL, upperBoundsOnIntercepts 
= NULL) {
 formula <- paste(deparse(formula), collapse = "")
+lrow <- 0
+lcol <- 0
+urow <- 0
+ucol <- 0
--- End diff --

Question: Based on my understanding, `lowerBoundsOnCoefficients ` and 
`upperBoundsOnCoefficients ` are not required to set at the same time. They can 
be set at the same time.
For the first case, we can't enforce the dimension of the two matrices 
because one could be `NULL`. 
For the second case, we can check it.

So, we can't enforce the rule in general.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18128: [SPARK-20906][SparkR]:Constrained Logistic Regression fo...

2017-05-31 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/18128
  
@yanboliang Can you take a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18128: [SPARK-20906][SparkR]:Constrained Logistic Regres...

2017-05-27 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/18128

[SPARK-20906][SparkR]:Constrained Logistic Regression for SparkR

## What changes were proposed in this pull request?

PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
Regression for ML. We should add it to SparkR.

## How was this patch tested?

Add new unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18128.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18128


commit 1fc68f69ecce46c8d4c2bbd2d9aafdd042c27108
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-05-27T06:27:04Z

add constraint logit

commit 7627ac9c093ba72afd586c3ea1e482238d29c3c3
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-05-27T07:29:25Z

add unit test and doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116345383
  
--- Diff: R/pkg/DESCRIPTION ---
@@ -42,6 +42,7 @@ Collate:
 'functions.R'
 'install.R'
 'jvm.R'
+'mllib_wrapper.R'
--- End diff --

Can you make it lexicographic order?
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116345166
  
--- Diff: R/pkg/R/mllib_regression.R ---
@@ -360,6 +338,7 @@ setMethod("spark.isoreg", signature(data = 
"SparkDataFrame", formula = "formula"
 
 #  Get the summary of an IsotonicRegressionModel model
 
+#' @param object a fitted IsotonicRegressionModel.
--- End diff --

You use capital A below. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116345323
  
--- Diff: R/pkg/R/mllib_wrapper.R ---
@@ -0,0 +1,61 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#' S4 class that represents a Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
+#' @export
+#' @note JavaModel since 2.3.0
+setClass("JavaModel", representation(jobj = "jobj"))
+
+#' Makes predictions from a Java ML model
+#'
+#' @param object a Spark ML model.
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
value.
+#' @rdname spark.predict
+#' @aliases predict,JavaModel-method
+#' @export
+#' @note predict since 2.3.0
+setMethod("predict", signature(object = "JavaModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' S4 class that represents a writable Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
+#' @export
+#' @note JavaMLWritable since 2.3.0
+setClass("JavaMLWritable", representation(jobj = "jobj"))
+
+#  Save the ML model to the output path.
+
+#' @param object A fitted ML model.
+#' @param path The directory where the model is saved.
+#' @param overwrite Overwrites or not if the output path already exists. 
Default is FALSE
--- End diff --

`O` -> `o` ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116345209
  
--- Diff: R/pkg/R/mllib_wrapper.R ---
@@ -0,0 +1,61 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#' S4 class that represents a Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
--- End diff --

`backing` -> `backend`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116345283
  
--- Diff: R/pkg/R/mllib_wrapper.R ---
@@ -0,0 +1,61 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#' S4 class that represents a Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
+#' @export
+#' @note JavaModel since 2.3.0
+setClass("JavaModel", representation(jobj = "jobj"))
+
+#' Makes predictions from a Java ML model
+#'
+#' @param object a Spark ML model.
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
value.
+#' @rdname spark.predict
+#' @aliases predict,JavaModel-method
+#' @export
+#' @note predict since 2.3.0
+setMethod("predict", signature(object = "JavaModel"),
+  function(object, newData) {
+predict_internal(object, newData)
+  })
+
+#' S4 class that represents a writable Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
+#' @export
+#' @note JavaMLWritable since 2.3.0
+setClass("JavaMLWritable", representation(jobj = "jobj"))
+
+#  Save the ML model to the output path.
+
+#' @param object A fitted ML model.
--- End diff --

`A` -> `a` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116344992
  
--- Diff: R/pkg/R/mllib_classification.R ---
@@ -22,29 +22,36 @@
 #'
 #' @param jobj a Java object reference to the backing Scala LinearSVCModel
 #' @export
+#' @include mllib_wrapper.R
 #' @note LinearSVCModel since 2.2.0
-setClass("LinearSVCModel", representation(jobj = "jobj"))
+setClass("LinearSVCModel", representation(jobj = "jobj"),
+ contains = c("JavaModel", "JavaMLWritable"))
 
 #' S4 class that represents an LogisticRegressionModel
 #'
 #' @param jobj a Java object reference to the backing Scala 
LogisticRegressionModel
 #' @export
 #' @note LogisticRegressionModel since 2.1.0
--- End diff --

Missing '#' @include mllib_wrapper.R'?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116344933
  
--- Diff: R/pkg/R/generics.R ---
@@ -1535,9 +1535,7 @@ setGeneric("spark.freqItemsets", function(object) { 
standardGeneric("spark.freqI
 #' @export
 setGeneric("spark.associationRules", function(object) { 
standardGeneric("spark.associationRules") })
 
-#' @param object a fitted ML model object.
--- End diff --

why remove the three lines?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17808: [SPARK-20533][SparkR]:SparkR Wrappers Model should be pr...

2017-04-29 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17808
  
I think we don't have to back-port. This is a small 
improvement/optimization of the original code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17808: [SPARK-20533][SparkR]:SparkR Wrappers Model shoul...

2017-04-29 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/17808

[SPARK-20533][SparkR]:SparkR Wrappers Model should be private and value 
should be lazy

## What changes were proposed in this pull request?

MultilayerPerceptronClassifierWrapper model should be private.
LogisticRegressionWrapper.scala rFeatures and rCoefficients should be lazy.

## How was this patch tested?

Unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark lazy

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17808.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17808


commit c1eaca911bf4aa4315929eda6ea6e7f6ceff04f4
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-04-29T16:49:14Z

change private and lazy




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17805: [SPARK-20477][SparkR][DOC]: Document R bisecting k-means...

2017-04-29 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17805
  
cc @felixcheung This is a similar documentation change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17805: [SparkR][DOC][SPARK-20477]: Document R bisecting ...

2017-04-28 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/17805

[SparkR][DOC][SPARK-20477]: Document R bisecting k-means in R programming 
guide

## What changes were proposed in this pull request?

Add hyper link in the SparkR programming guide.

## How was this patch tested?

Build doc and manually check the doc link.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17805


commit 540bf7a34dcb7db0892e3cadf24b0c01364162f2
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-04-28T17:02:04Z

add spark.bisectingKmeans doc in the programming guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-28 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113974703
  
--- Diff: R/pkg/R/serialize.R ---
@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) {
  Date = writeDate(con, object),
  POSIXlt = writeTime(con, object),
  POSIXct = writeTime(con, object),
+ bigint = writeDouble(con, object),
--- End diff --

For completeness purpose, I think we can keep the write logic in R side. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-28 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113972686
  
--- Diff: R/pkg/R/serialize.R ---
@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) {
  Date = writeDate(con, object),
  POSIXlt = writeTime(con, object),
  POSIXct = writeTime(con, object),
+ bigint = writeDouble(con, object),
--- End diff --

When using createDataFrame, R uses `serialize` to send data to the backend. 
When taking an action, say, `collect`, scala side logic refers to the schema 
field and calls the `readTypedObjects` where the newly added read logic kicks 
in. When it returns back to R side, the newly added write logic kicks in and R 
side can interpret it due to the R side read logic. It seems that the `write` 
logic in R side is not called, because we don't have specific type `bigint` in 
R. Right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17797: [SparkR][DOC]:Document LinearSVC in R programming guide

2017-04-27 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17797
  
@felixcheung  As I checked the SparkR programming guide, it seems that all 
machine learning parts are links to existing documents. So I just add the link 
to Linear SVM document and tested it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-27 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113853483
  
--- Diff: R/pkg/R/serialize.R ---
@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) {
  Date = writeDate(con, object),
  POSIXlt = writeTime(con, object),
  POSIXct = writeTime(con, object),
+ bigint = writeDouble(con, object),
--- End diff --

I see. But as you mentioned, we don't know how to trigger the write path on 
the R side, because both bigint and double are `numeric`. I think we can just 
remove the test in the R side. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17797: [SparkR][DOC]:Document LinearSVC in R programming...

2017-04-27 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/17797

[SparkR][DOC]:Document LinearSVC in R programming guide

## What changes were proposed in this pull request?

add link to svmLinear in the SparkR programming document.

## How was this patch tested?

Build doc manually and click the link to the document. It looks good.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17797.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17797


commit 3a59cc2a1741a2dae6f20fa71e689a0dcc16c835
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-04-28T05:07:46Z

add link to linear svc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-27 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113823246
  
--- Diff: R/pkg/R/serialize.R ---
@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) {
  Date = writeDate(con, object),
  POSIXlt = writeTime(con, object),
  POSIXct = writeTime(con, object),
+ bigint = writeDouble(con, object),
--- End diff --

@felixcheung Any thoughts? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-26 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113586516
  
--- Diff: R/pkg/R/serialize.R ---
@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) {
  Date = writeDate(con, object),
  POSIXlt = writeTime(con, object),
  POSIXct = writeTime(con, object),
+ bigint = writeDouble(con, object),
--- End diff --

If R doesn't have `bigint` type, we should remove all `bigint` related 
logic. I don't know the history of `bigint` mapping in the Types.R file. Why 
should we have it since every big number is numeric (Double in the backend)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-26 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113585851
  
--- Diff: R/pkg/R/serialize.R ---
@@ -83,6 +83,7 @@ writeObject <- function(con, object, writeType = TRUE) {
  Date = writeDate(con, object),
  POSIXlt = writeTime(con, object),
  POSIXct = writeTime(con, object),
+ bigint = writeDouble(con, object),
--- End diff --

When specifying schema with `bigint`, we will hit the bigint path. Without 
this change, it will thrown an error of type mismatch. But as you said, we 
can't specify `bigint` type in R console.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-25 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113362108
  
--- Diff: R/pkg/inst/tests/testthat/test_Serde.R ---
@@ -28,6 +28,10 @@ test_that("SerDe of primitive types", {
   expect_equal(x, 1)
   expect_equal(class(x), "numeric")
 
+  x <- callJStatic("SparkRHandler", "echo", 1380742793415240)
--- End diff --

I did some google search. R can't specify `bigint` type. So, we can't 
directly test `bigint` type.

We can remove the tests above, as we added `schema` tests and scala API 
tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-25 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113358460
  
--- Diff: R/pkg/inst/tests/testthat/test_Serde.R ---
@@ -28,6 +28,10 @@ test_that("SerDe of primitive types", {
   expect_equal(x, 1)
   expect_equal(class(x), "numeric")
 
+  x <- callJStatic("SparkRHandler", "echo", 1380742793415240)
--- End diff --

I don't know how to specify in R console to enforce bigint type. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-25 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17640#discussion_r113358355
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -3043,6 +3043,23 @@ test_that("catalog APIs, currentDatabase, 
setCurrentDatabase, listDatabases", {
   expect_equal(dbs[[1]], "default")
 })
 
+test_that("dapply with bigint type", {
+  df <- createDataFrame(
+list(list(1380742793415240, 1, "1"), list(1380742793415240, 2, 
"2"),
+list(1380742793415240, 3, "3")), c("a", "b", "c"))
+  schema <- structType(structField("a", "bigint"), structField("b", 
"bigint"),
--- End diff --

This one tests bigint


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17754: [FollowUp][SPARK-18901][ML]: Require in LR Logist...

2017-04-24 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/17754

[FollowUp][SPARK-18901][ML]: Require in LR LogisticAggregator is redundant

## What changes were proposed in this pull request?

This is a follow-up PR of #17478. 

## How was this patch tested?

Existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark followup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17754.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17754


commit dbff96111fd00c2127afe2a46515efc163aa36b8
Author: wangmiao1981 <wm...@hotmail.com>
Date:   2017-04-25T00:11:08Z

remove extra require check




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17478: [SPARK-18901][ML]:Require in LR LogisticAggregator is re...

2017-04-24 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17478
  
@yanboliang I will do it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-04-23 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
@felixcheung I just came back from vacation. I will make changes now. 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-04-17 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
I am adding more tests right now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-04-16 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
Based on my understanding, it does not directly solvethe 12360. This one 
just solves the serialization of a specific type `bigint` in struct field. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-04-16 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
For `Inf` case, I used a very large number:


1380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013
 
80742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240138074279341524013807427934152401380742793415240


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-04-15 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
If I use very big number, then sparkR shell will get the following output:
> collect(df1)
 a b cd
1  Inf 1 1  Inf

So the overflow problem has been taken care of in the Scala side. We don't 
have to add additional handling in R side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-04-14 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
cc @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-04-14 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
I will some bound check and error handling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17640: [SPARK-17608][SPARKR]:Long type has incorrect ser...

2017-04-14 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/17640

[SPARK-17608][SPARKR]:Long type has incorrect serialization/deserialization

## What changes were proposed in this pull request?
`bigint` is not supported in schema and the serialization is not `Double`.

Add `bigint` support in schema and serialized and deserialized as `Double`.

This fix is orthogonal to the precision problem in 
https://issues.apache.org/jira/browse/SPARK-12360  

## How was this patch tested?

Add a new unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark summary

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17640.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17640


commit 03b82ac19dcbe17a70d9e45790dd24210b6d4f07
Author: wm...@hotmail.com <wm...@hotmail.com>
Date:   2017-04-14T17:43:35Z

add bigint support




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17611: [SPARK-20298][SparkR][MINOR] fixed spelling mistake "cha...

2017-04-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17611
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17611: [SPARK-20298][SparkR][MINOR] fixed spelling mistake "cha...

2017-04-11 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17611
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17478: [SPARK-18901][ML]:Require in LR LogisticAggregator is re...

2017-03-30 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17478
  
@sethah Thanks for your reply! Your suggestion makes sense to me. My 
intention was to close the JIRA by simple fix. How about we add a test for 
these checks and close the original JIRA? or you think just mark that JIRA as 
WON'T Fix? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >