[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-11-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15414


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r83057213
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param.ParamMap
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  import PredictorSuite._
+
+  test("should support all NumericType labels and not support other 
types") {
+val df = spark.createDataFrame(Seq(
+  (0, Vectors.dense(0, 2, 3)),
+  (1, Vectors.dense(0, 3, 9)),
+  (0, Vectors.dense(0, 2, 6))
+)).toDF("label", "features")
+
+val types =
+  Seq(ShortType, LongType, IntegerType, FloatType, ByteType, 
DoubleType, DecimalType(10, 0))
+
+val predictor = new MockPredictor()
+
+types.foreach { t =>
+  predictor.fit(df.select(col("label").cast(t), col("features")))
+}
+
+intercept[IllegalArgumentException] {
+  predictor.fit(df.select(col("label").cast(StringType), 
col("features")))
+}
+  }
+}
+
+object PredictorSuite {
+
+  class MockPredictor(override val uid: String)
+extends Predictor[Vector, MockPredictor, MockPredictionModel] {
+
+def this() = this(Identifiable.randomUID("mockpredictor"))
+
+override def train(dataset: Dataset[_]): MockPredictionModel = {
+  require(dataset.schema("label").dataType == DoubleType)
+  new MockPredictionModel(uid)
+}
+
+override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra)
--- End diff --

change the copy methods to throw NotImplementedError


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-12 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r82988194
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param.ParamMap
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.types._
+
+class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  import testImplicits._
+
+  class MockPredictor(override val uid: String)
+extends Predictor[Vector, MockPredictor, MockPredictionModel] {
+
+override def train(dataset: Dataset[_]): MockPredictionModel = {
+  require(dataset.schema("label").dataType == DoubleType)
+  new MockPredictionModel(uid)
+}
+
+override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra)
+  }
+
+  class MockPredictionModel(override val uid: String)
+extends PredictionModel[Vector, MockPredictionModel] {
+
+override def predict(features: Vector): Double = 1.0
+
+override def copy(extra: ParamMap): MockPredictionModel = 
defaultCopy(extra)
+  }
+
+  test("should support all NumericType labels and not support other 
types") {
+val predictor = new MockPredictor("mock")
+MLTestingUtils.checkNumericTypes[MockPredictionModel, MockPredictor](
--- End diff --

OK, I will update this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-11 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r82931901
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param.ParamMap
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.types._
+
+class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  import testImplicits._
+
+  class MockPredictor(override val uid: String)
--- End diff --

move into companion object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-11 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r82932068
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -121,10 +122,18 @@ abstract class Predictor[
* and put it in an RDD with strong types.
*/
   protected def extractLabeledPoints(dataset: Dataset[_]): 
RDD[LabeledPoint] = {
-dataset.select(col($(labelCol)).cast(DoubleType), 
col($(featuresCol))).rdd.map {
+dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map {
   case Row(label: Double, features: Vector) => LabeledPoint(label, 
features)
 }
   }
+
+  /**
+   * Return the given DataFrame, with [[labelCol]] casted to DoubleType.
+   */
+protected def castDataSet(dataset: Dataset[_]): DataFrame = {
--- End diff --

let's just put this logic directly in `fit`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-11 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r82935295
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param.ParamMap
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.types._
+
+class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  import testImplicits._
+
+  class MockPredictor(override val uid: String)
+extends Predictor[Vector, MockPredictor, MockPredictionModel] {
+
+override def train(dataset: Dataset[_]): MockPredictionModel = {
+  require(dataset.schema("label").dataType == DoubleType)
+  new MockPredictionModel(uid)
+}
+
+override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra)
+  }
+
+  class MockPredictionModel(override val uid: String)
+extends PredictionModel[Vector, MockPredictionModel] {
+
+override def predict(features: Vector): Double = 1.0
--- End diff --

`override def predict(features: Vector): Double = throw new 
NotImplementedError()` We can do this for everything except `train`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-11 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r82932894
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param.ParamMap
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.types._
+
+class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  import testImplicits._
+
+  class MockPredictor(override val uid: String)
+extends Predictor[Vector, MockPredictor, MockPredictionModel] {
+
+override def train(dataset: Dataset[_]): MockPredictionModel = {
+  require(dataset.schema("label").dataType == DoubleType)
+  new MockPredictionModel(uid)
+}
+
+override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra)
+  }
+
+  class MockPredictionModel(override val uid: String)
+extends PredictionModel[Vector, MockPredictionModel] {
+
+override def predict(features: Vector): Double = 1.0
+
+override def copy(extra: ParamMap): MockPredictionModel = 
defaultCopy(extra)
+  }
+
+  test("should support all NumericType labels and not support other 
types") {
+val predictor = new MockPredictor("mock")
+MLTestingUtils.checkNumericTypes[MockPredictionModel, MockPredictor](
--- End diff --

Why don't we just cycle through the types here and call `fit`. I think it's 
a bit confusing the way it is now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-11 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r82932799
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala ---
@@ -117,7 +117,7 @@ object MLTestingUtils extends SparkFunSuite {
   Seq(ShortType, LongType, IntegerType, FloatType, ByteType, 
DoubleType, DecimalType(10, 0))
 types.map { t =>
 val castDF = df.select(col(labelColName).cast(t), 
col(featuresColName))
-t -> TreeTests.setMetadata(castDF, 2, labelColName, 
featuresColName)
+t -> TreeTests.setMetadata(castDF, 0, labelColName, 
featuresColName)
--- End diff --

What is this for? If the intent is to force `getNumClasses` to infer the 
number of classes, then you're no longer testing the not inferred case. 
Further, the point of this PR is to eliminate the need to do that since it is 
not a robust solution, IMO. 

Also, I'd like to remove the dependence on `TreeTests` here (and 
`genRegressionDF`) and just explicitly set the attributes in the functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-10 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r82718401
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -121,10 +122,18 @@ abstract class Predictor[
* and put it in an RDD with strong types.
*/
   protected def extractLabeledPoints(dataset: Dataset[_]): 
RDD[LabeledPoint] = {
-dataset.select(col($(labelCol)).cast(DoubleType), 
col($(featuresCol))).rdd.map {
+dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map {
   case Row(label: Double, features: Vector) => LabeledPoint(label, 
features)
 }
   }
+
+  /**
+   * Return the given DataFrame, with [[labelCol]] casted to DoubleType.
+   */
+protected def castDataSet(dataset: Dataset[_]): DataFrame = {
+  val labelMeta = dataset.schema.fields.filter(_.name == 
$(labelCol)).head.metadata
--- End diff --

Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-10 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/15414#discussion_r82684098
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -121,10 +122,18 @@ abstract class Predictor[
* and put it in an RDD with strong types.
*/
   protected def extractLabeledPoints(dataset: Dataset[_]): 
RDD[LabeledPoint] = {
-dataset.select(col($(labelCol)).cast(DoubleType), 
col($(featuresCol))).rdd.map {
+dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map {
   case Row(label: Double, features: Vector) => LabeledPoint(label, 
features)
 }
   }
+
+  /**
+   * Return the given DataFrame, with [[labelCol]] casted to DoubleType.
+   */
+protected def castDataSet(dataset: Dataset[_]): DataFrame = {
+  val labelMeta = dataset.schema.fields.filter(_.name == 
$(labelCol)).head.metadata
--- End diff --

Maybe simplify it: dataset.schema("value").metadata


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...

2016-10-09 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/15414

[SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fit

## What changes were proposed in this pull request?
1, move cast to `Predictor`
2, and then, remove unnecessary cast

## How was this patch tested?
existing tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark move_cast

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15414.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15414


commit 5cb06fcd7987d1889b42a47f38bff89a47161123
Author: Zheng RuiFeng 
Date:   2016-10-10T05:44:47Z

create pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org