[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34830515 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) --- End diff -- Ah, the problem is that featureTransformer is used for both transform and transformSchema (and I think we'll need it to transform the input data to predict). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-122069902 Sounds good, I'll look at the R integration next. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7381 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121817685 LGTM except some minor comments, which we can fix in the next PR. Merged into master. Thanks! As the next step, we can create a wrapper for `RFormula + LinearRegression` on the Scala side and then call it in R. Independently, we can add features to `RModelParser`. I'd recommend the former first in order to have some working MLlib + SparkR features in 1.5. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34753039 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) +assert(resultSchema.toString == expected.schema.toString) +assert( + result.collect().map(_.toString).sorted.mkString(,) == --- End diff -- `===` doesn't require `toSeq` to work. I think it is useful to use `===` everywhere in tests, just to make the code consistent. We can do this in next PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34753079 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) --- End diff -- Actually, I mean `featureTransformer.transform` - `transformFeatures`. This is minor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121776334 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121776348 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121776841 [Test build #37425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37425/consoleFull) for PR 7381 at commit [`2db68aa`](https://github.com/apache/spark/commit/2db68aaa26d2a963b528449a80cc6cd294c8ec06). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34742755 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) --- End diff -- I see. Is the metadata important (should we include it in transformSchema)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34742729 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + private def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) --- End diff -- I added a check for this case, but kept the defaults as feature and label unless you think we should always randomize. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34742685 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) --- End diff -- Kept as get(), since toString should not throw. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34742784 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) +assert(resultSchema.toString == expected.schema.toString) +assert( + result.collect().map(_.toString).sorted.mkString(,) == --- End diff -- == works for me, with the expected diffs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121777321 @mengxr That makes sense, I'll do that in a followup PR. I also addressed the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121785607 [Test build #37425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37425/console) for PR 7381 at commit [`2db68aa`](https://github.com/apache/spark/commit/2db68aaa26d2a963b528449a80cc6cd294c8ec06). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RFormula(override val uid: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121778102 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121778086 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121778232 [Test build #37426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37426/consoleFull) for PR 7381 at commit [`d1959d2`](https://github.com/apache/spark/commit/d1959d2818b11c6b173442deb6582e73557545c2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121785796 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121788896 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121788851 [Test build #37426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37426/console) for PR 7381 at commit [`d1959d2`](https://github.com/apache/spark/commit/d1959d2818b11c6b173442deb6582e73557545c2). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RFormula(override val uid: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34618001 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) --- End diff -- Should it be in a separate test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617993 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -116,7 +116,7 @@ class VectorAssembler(override val uid: String) if (schema.fieldNames.contains(outputColName)) { throw new IllegalArgumentException(sOutput column $outputColName already exists.) } -StructType(schema.fields :+ new StructField(outputColName, new VectorUDT, false)) +StructType(schema.fields :+ new StructField(outputColName, new VectorUDT, true)) --- End diff -- Is this change necessary? We always assume that the vector is always available. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + protected def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) + case StringType = +new StringIndexer(uid) --- End diff -- Should use a random uid. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617767 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + --- End diff -- Missing setters for `featuresCol` and `labelCol`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617756 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None --- End diff -- Why is this `protected` instead of `private`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617760 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) --- End diff -- * missing `@group setParam` * `a R` - `an R` * missing `getFormula` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617742 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) --- End diff -- Missing doc and `@group param` in the ScalaDoc. The group is used to group methods in the generated Scala doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617806 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + protected def transformLabel(dataset: DataFrame): DataFrame = { --- End diff -- private? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617741 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) --- End diff -- Remove `private[spark]` so Scala users can also use it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617888 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + protected def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) + case StringType = +new StringIndexer(uid) + .setInputCol(responseName) + .setOutputCol($(labelCol)) + .fit(dataset) + .transform(dataset) + case other = +throw new IllegalArgumentException(Unsupported type for response: + other) +} + } + + protected def featureTransformer: Transformer = { +// TODO(ekl) add support for non-numeric features and feature interactions +new VectorAssembler(uid) + .setInputCols(parsedFormula.get.terms.toArray) + .setOutputCol($(featuresCol)) + } +} + +/** + * :: Experimental :: --- End diff -- We don't need `:: Experimental ::` on private classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34618021 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) +assert(resultSchema.toString == expected.schema.toString) +assert( + result.collect.map(_.toString).mkString(,) == --- End diff -- `collect` - `collect()` (because it is an action). `collect` doesn't really guarantee the ordering. So it would be nice to put the expected result along with the input data as extra columns. Then make assertions on each record. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617985 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + protected def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) + case StringType = +new StringIndexer(uid) + .setInputCol(responseName) + .setOutputCol($(labelCol)) + .fit(dataset) + .transform(dataset) + case other = +throw new IllegalArgumentException(Unsupported type for response: + other) +} + } + + protected def featureTransformer: Transformer = { +// TODO(ekl) add support for non-numeric features and feature interactions +new VectorAssembler(uid) + .setInputCols(parsedFormula.get.terms.toArray) + .setOutputCol($(featuresCol)) + } +} + +/** + * :: Experimental :: + * Represents a parsed R formula. + */ +private[ml] case class RFormula(response: String, terms: Seq[String]) + +/** + * :: Experimental :: + * Limited implementation of R formula parsing. Currently supports: '~', '+'. + */ +private[ml] object RFormulaParser extends RegexParsers { + def term: Parser[String] = ([a-zA-Z]|\\.[a-zA-Z_])[a-zA-Z0-9._]*.r --- End diff -- Does R accept `$` in terms? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34617695 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. --- End diff -- Also mention the operators we support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632802 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632796 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121430082 [Test build #37282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37282/consoleFull) for PR 7381 at commit [`dc3c943`](https://github.com/apache/spark/commit/dc3c943a9e3167cd419451b3d83a720db5152b23). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121429679 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121429694 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632816 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + protected def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) + case StringType = +new StringIndexer(uid) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632827 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -116,7 +116,7 @@ class VectorAssembler(override val uid: String) if (schema.fieldNames.contains(outputColName)) { throw new IllegalArgumentException(sOutput column $outputColName already exists.) } -StructType(schema.fields :+ new StructField(outputColName, new VectorUDT, false)) +StructType(schema.fields :+ new StructField(outputColName, new VectorUDT, true)) --- End diff -- I noticed that the schema of transform() has it as nullable, so probably transformSchema() should also. One alternative is to make transform() mark the vector as non-null, but I am not exactly sure how to do that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632829 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) --- End diff -- I put it here since the parser is basically private to RModelFormula but could be convinced otherwise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632824 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + protected def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) + case StringType = +new StringIndexer(uid) + .setInputCol(responseName) + .setOutputCol($(labelCol)) + .fit(dataset) + .transform(dataset) + case other = +throw new IllegalArgumentException(Unsupported type for response: + other) +} + } + + protected def featureTransformer: Transformer = { +// TODO(ekl) add support for non-numeric features and feature interactions +new VectorAssembler(uid) + .setInputCols(parsedFormula.get.terms.toArray) + .setOutputCol($(featuresCol)) + } +} + +/** + * :: Experimental :: + * Represents a parsed R formula. + */ +private[ml] case class RFormula(response: String, terms: Seq[String]) + +/** + * :: Experimental :: + * Limited implementation of R formula parsing. Currently supports: '~', '+'. + */ +private[ml] object RFormulaParser extends RegexParsers { + def term: Parser[String] = ([a-zA-Z]|\\.[a-zA-Z_])[a-zA-Z0-9._]*.r --- End diff -- Looks like R supports arbitrary expressions in terms, so we'd need a full parser to be sure. For $ am I not sure it makes sense, since we assume the terms are from the
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632806 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632795 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632811 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + protected def transformLabel(dataset: DataFrame): DataFrame = { --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632800 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632803 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121433780 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637543 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) --- End diff -- use `===` instead of `==` (and please update other `==`s) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637538 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + private def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response --- End diff -- response, target, or label are all valid names. In MLlib, we use label. So it might be useful to rename response to label. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637540 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + private def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) --- End diff -- What if the `responseName` is the same as `labelCol`? This may cause unexpected behavior. If the input is `DoubleType`, we should allow `labelCol` be the same as the target term in the formula. If we need to do transformation, then user should set a different `labelCol`. We can set the default `featuresCol` and `labelCol` based on the uid and hence it won't have name collision. I don't think this is a good solution, but I don't have good suggestions. Btw, we can use `DataFrame.withColumn` to append a new column. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637533 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) --- End diff -- `col` - `value` (to be consistent with other setters. Since the method name already contains this info, it is not necessary to repeat that for the arg, especially for really long names) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637537 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) --- End diff -- minor: `${get(formula))` - `$getFormula` (slightly easier to read) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637535 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) --- End diff -- To be consistent, rename `featureTransformer` to `transformFeatures`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637529 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula --- End diff -- Use `http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html` instead, which is in the raw R manual format. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637545 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) +assert(resultSchema.toString == expected.schema.toString) +assert( + result.collect().map(_.toString).sorted.mkString(,) == --- End diff -- I don't think we need `toString` and `mkString(,)`. Maybe `sorted` is not necessary either. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34637544 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) --- End diff -- Maybe it is worth leaving a TODO here for `DataType.equals`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632817 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against a R model formula. + */ +@Experimental +private[spark] class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + val formula: Param[String] = new Param(this, formula, R model formula) + protected var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @param value a R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + protected def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) + case StringType = +new StringIndexer(uid) + .setInputCol(responseName) + .setOutputCol($(labelCol)) + .fit(dataset) + .transform(dataset) + case other = +throw new IllegalArgumentException(Unsupported type for response: + other) +} + } + + protected def featureTransformer: Transformer = { +// TODO(ekl) add support for non-numeric features and feature interactions +new VectorAssembler(uid) + .setInputCols(parsedFormula.get.terms.toArray) + .setOutputCol($(featuresCol)) + } +} + +/** + * :: Experimental :: --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34632833 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) +assert(resultSchema.toString == expected.schema.toString) +assert( + result.collect.map(_.toString).mkString(,) == --- End diff -- Do you know the right way to compare schemas / Rows for equality? It seems equals() is not implemented for either. Also added sorted to fix the ordering issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121433740 [Test build #37282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37282/console) for PR 7381 at commit [`dc3c943`](https://github.com/apache/spark/commit/dc3c943a9e3167cd419451b3d83a720db5152b23). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RModelFormula(override val uid: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34643356 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) --- End diff -- Just figured out why. The column output from `VectorAssembler` also contains ML attributes that stores feature names. It is not included in `toString` ... If you compare the JSON value, you see: ~~~scala metadata:{[ml_attr:{attrs:{numeric:[{idx:0,name:v1},{idx:1,name:v2}]},num_attrs:2}]} ~~~ from the output. So I think the correct TODO message is also check metadata. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34643345 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala --- @@ -116,7 +116,7 @@ class VectorAssembler(override val uid: String) if (schema.fieldNames.contains(outputColName)) { throw new IllegalArgumentException(sOutput column $outputColName already exists.) } -StructType(schema.fields :+ new StructField(outputColName, new VectorUDT, false)) +StructType(schema.fields :+ new StructField(outputColName, new VectorUDT, true)) --- End diff -- Okay, I think this is minor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34643347 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ --- End diff -- remove unused imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34643358 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) + val parsed = RFormulaParser.parse(formula) + assert(parsed.response == response) + assert(parsed.terms == terms) +} +check(y ~ x, y, Seq(x)) +check(y ~ ._foo , y, Seq(._foo)) +check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123)) + } + + test(transform numeric data) { +val formula = new RModelFormula().setFormula(id ~ v1 + v2) +val original = sqlContext.createDataFrame( + Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2) +val result = formula.transform(original) +val resultSchema = formula.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0), +(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0)) + ).toDF(id, v1, v2, features, label) +assert(result.schema.toString == resultSchema.toString) +assert(resultSchema.toString == expected.schema.toString) +assert( + result.collect().map(_.toString).sorted.mkString(,) == --- End diff -- `assert(result.collect() === expected.collect())` works for me. Note that `===` works but `==` doesn't. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34643348 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { --- End diff -- `RFormulaModelSuite` - `RModelFormulaSuite` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34643578 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala --- @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext { + test(params) { +ParamsSuite.checkParams(new RModelFormula()) + } + + test(parse simple formulas) { +def check(formula: String, response: String, terms: Seq[String]) { + new RModelFormula().setFormula(formula) --- End diff -- Whether to test private class or not might result much longer discussion:) In MLlib, usually we expose few public APIs, while the implementation might consist of several pieces. It is useful to test each piece individually though they are not public. For example, in ALS, https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala#L53, it is hard to make useful unit test without unit testing individual components. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121485499 @ericl I make another pass. The major issue is actually that `RModelFormula` should be an `Estimator` instead of a `Transformer` in order to handle String columns. It requires some changes to the current implementation. So I would suggest removing the support for string labels in this PR and address it in a follow-up PR, since we already reviewed most of the code. It is okay to just comment out the test. Does it sound good to you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7381#discussion_r34643726 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala --- @@ -0,0 +1,136 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.parsing.combinator.RegexParsers + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * :: Experimental :: + * Implements the transforms required for fitting a dataset against an R model formula. Currently + * we support a limited subset of the R operators, including '~' and '+'. Also see the R formula + * docs here: http://www.inside-r.org/r-doc/stats/formula + */ +@Experimental +class RModelFormula(override val uid: String) + extends Transformer with HasFeaturesCol with HasLabelCol { + + def this() = this(Identifiable.randomUID(rModelFormula)) + + /** + * R formula parameter. The formula is provided in string form. + * @group setParam + */ + val formula: Param[String] = new Param(this, formula, R model formula) + + private var parsedFormula: Option[RFormula] = None + + /** + * Sets the formula to use for this transformer. Must be called before use. + * @group setParam + * @param value an R formula in string form (e.g. y ~ x + z) + */ + def setFormula(value: String): this.type = { +parsedFormula = Some(RFormulaParser.parse(value)) +set(formula, value) +this + } + + /** @group getParam */ + def getFormula: String = $(formula) + + /** @group getParam */ + def setFeaturesCol(col: String): this.type = set(featuresCol, col) + + /** @group getParam */ + def setLabelCol(col: String): this.type = set(labelCol, col) + + override def transformSchema(schema: StructType): StructType = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val withFeatures = featureTransformer.transformSchema(schema) +val nullable = schema(parsedFormula.get.response).dataType match { + case _: NumericType | BooleanType = false + case _ = true +} +StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, nullable)) + } + + override def transform(dataset: DataFrame): DataFrame = { +require(parsedFormula.isDefined, Must call setFormula() first.) +transformLabel(featureTransformer.transform(dataset)) + } + + override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra) + + override def toString: String = sRModelFormula(${get(formula)}) + + private def transformLabel(dataset: DataFrame): DataFrame = { +val responseName = parsedFormula.get.response +dataset.schema(responseName).dataType match { + case _: NumericType | BooleanType = +dataset.select( + col(*), + dataset(responseName).cast(DoubleType).as($(labelCol))) + case StringType = +new StringIndexer() --- End diff -- It might be necessary to implement `RModelFormula` as an `Estimator`. Otherwise, this StringIndexer() will be called every time when `transform` is called. If the input dataset is different, it would result different answers. For this PR, how about removing support for string labels. In a follow-up PR, we can make `RModelFormula` as an `Estimator`, whose `fit` returns a `RModelFormulaModel` ... (The name is awkward. Maybe we should call `RFormula` and `RFormulaModel` instead.) --- If your project is set up for it, you can reply to this email and have your reply appear on
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121099053 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121102812 [Test build #37170 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37170/consoleFull) for PR 7381 at commit [`5765ec6`](https://github.com/apache/spark/commit/5765ec6ace737049c91a1096f3e5c4670a2b19f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121099480 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121107200 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121107144 [Test build #37170 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37170/console) for PR 7381 at commit [`5765ec6`](https://github.com/apache/spark/commit/5765ec6ace737049c91a1096f3e5c4670a2b19f2). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121102610 [Test build #37167 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37167/consoleFull) for PR 7381 at commit [`1f361b0`](https://github.com/apache/spark/commit/1f361b0e0f6a7de12a39bc1b75fd59f6a7128ab8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121102116 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
GitHub user ericl opened a pull request: https://github.com/apache/spark/pull/7381 [SPARK-8774] [ML] Add R model formula with basic support as a transformer This implements minimal R formula support as a feature transformer. Both numeric and string labels are supported, but features must be numeric for now. cc @mengxr You can merge this pull request into a Git repository by running: $ git pull https://github.com/ericl/spark spark-8774-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7381.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7381 commit fb0826b875d8cda29dce6ec6654cdf0f66ac958f Author: Eric Liang e...@databricks.com Date: 2015-07-14T00:32:11Z [SPARK-8774] Add R model formula with basic support as a transformer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121099302 [Test build #37166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37166/consoleFull) for PR 7381 at commit [`fb0826b`](https://github.com/apache/spark/commit/fb0826b875d8cda29dce6ec6654cdf0f66ac958f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121099478 [Test build #37166 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37166/console) for PR 7381 at commit [`fb0826b`](https://github.com/apache/spark/commit/fb0826b875d8cda29dce6ec6654cdf0f66ac958f). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121102765 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121102101 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121102778 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121099063 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121108982 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7381#issuecomment-121108947 [Test build #37167 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37167/console) for PR 7381 at commit [`1f361b0`](https://github.com/apache/spark/commit/1f361b0e0f6a7de12a39bc1b75fd59f6a7128ab8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org