[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15414 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r83057213 --- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import PredictorSuite._ + + test("should support all NumericType labels and not support other types") { +val df = spark.createDataFrame(Seq( + (0, Vectors.dense(0, 2, 3)), + (1, Vectors.dense(0, 3, 9)), + (0, Vectors.dense(0, 2, 6)) +)).toDF("label", "features") + +val types = + Seq(ShortType, LongType, IntegerType, FloatType, ByteType, DoubleType, DecimalType(10, 0)) + +val predictor = new MockPredictor() + +types.foreach { t => + predictor.fit(df.select(col("label").cast(t), col("features"))) +} + +intercept[IllegalArgumentException] { + predictor.fit(df.select(col("label").cast(StringType), col("features"))) +} + } +} + +object PredictorSuite { + + class MockPredictor(override val uid: String) +extends Predictor[Vector, MockPredictor, MockPredictionModel] { + +def this() = this(Identifiable.randomUID("mockpredictor")) + +override def train(dataset: Dataset[_]): MockPredictionModel = { + require(dataset.schema("label").dataType == DoubleType) + new MockPredictionModel(uid) +} + +override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra) --- End diff -- change the copy methods to throw NotImplementedError --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82988194 --- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types._ + +class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import testImplicits._ + + class MockPredictor(override val uid: String) +extends Predictor[Vector, MockPredictor, MockPredictionModel] { + +override def train(dataset: Dataset[_]): MockPredictionModel = { + require(dataset.schema("label").dataType == DoubleType) + new MockPredictionModel(uid) +} + +override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra) + } + + class MockPredictionModel(override val uid: String) +extends PredictionModel[Vector, MockPredictionModel] { + +override def predict(features: Vector): Double = 1.0 + +override def copy(extra: ParamMap): MockPredictionModel = defaultCopy(extra) + } + + test("should support all NumericType labels and not support other types") { +val predictor = new MockPredictor("mock") +MLTestingUtils.checkNumericTypes[MockPredictionModel, MockPredictor]( --- End diff -- OK, I will update this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82931901 --- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types._ + +class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import testImplicits._ + + class MockPredictor(override val uid: String) --- End diff -- move into companion object. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82932068 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -121,10 +122,18 @@ abstract class Predictor[ * and put it in an RDD with strong types. */ protected def extractLabeledPoints(dataset: Dataset[_]): RDD[LabeledPoint] = { -dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map { +dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { case Row(label: Double, features: Vector) => LabeledPoint(label, features) } } + + /** + * Return the given DataFrame, with [[labelCol]] casted to DoubleType. + */ +protected def castDataSet(dataset: Dataset[_]): DataFrame = { --- End diff -- let's just put this logic directly in `fit` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82935295 --- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types._ + +class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import testImplicits._ + + class MockPredictor(override val uid: String) +extends Predictor[Vector, MockPredictor, MockPredictionModel] { + +override def train(dataset: Dataset[_]): MockPredictionModel = { + require(dataset.schema("label").dataType == DoubleType) + new MockPredictionModel(uid) +} + +override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra) + } + + class MockPredictionModel(override val uid: String) +extends PredictionModel[Vector, MockPredictionModel] { + +override def predict(features: Vector): Double = 1.0 --- End diff -- `override def predict(features: Vector): Double = throw new NotImplementedError()` We can do this for everything except `train`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82932894 --- Diff: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types._ + +class PredictorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + import testImplicits._ + + class MockPredictor(override val uid: String) +extends Predictor[Vector, MockPredictor, MockPredictionModel] { + +override def train(dataset: Dataset[_]): MockPredictionModel = { + require(dataset.schema("label").dataType == DoubleType) + new MockPredictionModel(uid) +} + +override def copy(extra: ParamMap): MockPredictor = defaultCopy(extra) + } + + class MockPredictionModel(override val uid: String) +extends PredictionModel[Vector, MockPredictionModel] { + +override def predict(features: Vector): Double = 1.0 + +override def copy(extra: ParamMap): MockPredictionModel = defaultCopy(extra) + } + + test("should support all NumericType labels and not support other types") { +val predictor = new MockPredictor("mock") +MLTestingUtils.checkNumericTypes[MockPredictionModel, MockPredictor]( --- End diff -- Why don't we just cycle through the types here and call `fit`. I think it's a bit confusing the way it is now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82932799 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala --- @@ -117,7 +117,7 @@ object MLTestingUtils extends SparkFunSuite { Seq(ShortType, LongType, IntegerType, FloatType, ByteType, DoubleType, DecimalType(10, 0)) types.map { t => val castDF = df.select(col(labelColName).cast(t), col(featuresColName)) -t -> TreeTests.setMetadata(castDF, 2, labelColName, featuresColName) +t -> TreeTests.setMetadata(castDF, 0, labelColName, featuresColName) --- End diff -- What is this for? If the intent is to force `getNumClasses` to infer the number of classes, then you're no longer testing the not inferred case. Further, the point of this PR is to eliminate the need to do that since it is not a robust solution, IMO. Also, I'd like to remove the dependence on `TreeTests` here (and `genRegressionDF`) and just explicitly set the attributes in the functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82718401 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -121,10 +122,18 @@ abstract class Predictor[ * and put it in an RDD with strong types. */ protected def extractLabeledPoints(dataset: Dataset[_]): RDD[LabeledPoint] = { -dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map { +dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { case Row(label: Double, features: Vector) => LabeledPoint(label, features) } } + + /** + * Return the given DataFrame, with [[labelCol]] casted to DoubleType. + */ +protected def castDataSet(dataset: Dataset[_]): DataFrame = { + val labelMeta = dataset.schema.fields.filter(_.name == $(labelCol)).head.metadata --- End diff -- Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/15414#discussion_r82684098 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -121,10 +122,18 @@ abstract class Predictor[ * and put it in an RDD with strong types. */ protected def extractLabeledPoints(dataset: Dataset[_]): RDD[LabeledPoint] = { -dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map { +dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { case Row(label: Double, features: Vector) => LabeledPoint(label, features) } } + + /** + * Return the given DataFrame, with [[labelCol]] casted to DoubleType. + */ +protected def castDataSet(dataset: Dataset[_]): DataFrame = { + val labelMeta = dataset.schema.fields.filter(_.name == $(labelCol)).head.metadata --- End diff -- Maybe simplify it: dataset.schema("value").metadata --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15414: [SPARK-17848][ML] Move LabelCol datatype cast int...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/15414 [SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fit ## What changes were proposed in this pull request? 1, move cast to `Predictor` 2, and then, remove unnecessary cast ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark move_cast Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15414.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15414 commit 5cb06fcd7987d1889b42a47f38bff89a47161123 Author: Zheng RuiFengDate: 2016-10-10T05:44:47Z create pr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org