[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-16 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34830515
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
--- End diff --

Ah, the problem is that featureTransformer is used for both transform and 
transformSchema (and I think we'll need it to transform the input data to 
predict).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-16 Thread ericl
Github user ericl commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-122069902
  
Sounds good, I'll look at the R integration next.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7381


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121817685
  
LGTM except some minor comments, which we can fix in the next PR. Merged 
into master. Thanks! As the next step, we can create a wrapper for `RFormula + 
LinearRegression` on the Scala side and then call it in R. Independently, we 
can add features to `RModelParser`. I'd recommend the former first in order to 
have some working MLlib + SparkR features in 1.5.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34753039
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
+assert(resultSchema.toString == expected.schema.toString)
+assert(
+  result.collect().map(_.toString).sorted.mkString(,) ==
--- End diff --

`===` doesn't require `toSeq` to work. I think it is useful to use `===` 
everywhere in tests, just to make the code consistent. We can do this in next 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34753079
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
--- End diff --

Actually, I mean `featureTransformer.transform` - `transformFeatures`. 
This is minor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121776334
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121776348
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121776841
  
  [Test build #37425 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37425/consoleFull)
 for   PR 7381 at commit 
[`2db68aa`](https://github.com/apache/spark/commit/2db68aaa26d2a963b528449a80cc6cd294c8ec06).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34742755
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
--- End diff --

I see. Is the metadata important (should we include it in transformSchema)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34742729
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  private def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
--- End diff --

I added a check for this case, but kept the defaults as feature and 
label unless you think we should always randomize.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34742685
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
--- End diff --

Kept as get(), since toString should not throw.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34742784
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
+assert(resultSchema.toString == expected.schema.toString)
+assert(
+  result.collect().map(_.toString).sorted.mkString(,) ==
--- End diff --

== works for me, with the expected diffs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread ericl
Github user ericl commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121777321
  
@mengxr That makes sense, I'll do that in a followup PR. I also addressed 
the comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121785607
  
  [Test build #37425 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37425/console)
 for   PR 7381 at commit 
[`2db68aa`](https://github.com/apache/spark/commit/2db68aaa26d2a963b528449a80cc6cd294c8ec06).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RFormula(override val uid: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121778102
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121778086
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121778232
  
  [Test build #37426 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37426/consoleFull)
 for   PR 7381 at commit 
[`d1959d2`](https://github.com/apache/spark/commit/d1959d2818b11c6b173442deb6582e73557545c2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121785796
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121788896
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121788851
  
  [Test build #37426 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37426/console)
 for   PR 7381 at commit 
[`d1959d2`](https://github.com/apache/spark/commit/d1959d2818b11c6b173442deb6582e73557545c2).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RFormula(override val uid: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34618001
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
--- End diff --

Should it be in a separate test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617993
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -116,7 +116,7 @@ class VectorAssembler(override val uid: String)
 if (schema.fieldNames.contains(outputColName)) {
   throw new IllegalArgumentException(sOutput column $outputColName 
already exists.)
 }
-StructType(schema.fields :+ new StructField(outputColName, new 
VectorUDT, false))
+StructType(schema.fields :+ new StructField(outputColName, new 
VectorUDT, true))
--- End diff --

Is this change necessary? We always assume that the vector is always 
available.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  protected def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
+  case StringType =
+new StringIndexer(uid)
--- End diff --

Should use a random uid.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617767
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
--- End diff --

Missing setters for `featuresCol` and `labelCol`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617756
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
--- End diff --

Why is this `protected` instead of `private`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617760
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
--- End diff --

* missing `@group setParam`
* `a R` - `an R`
* missing `getFormula`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617742
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
--- End diff --

Missing doc and `@group param` in the ScalaDoc. The group is used to group 
methods in the generated Scala doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617806
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  protected def transformLabel(dataset: DataFrame): DataFrame = {
--- End diff --

private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617741
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
--- End diff --

Remove `private[spark]` so Scala users can also use it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617888
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  protected def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
+  case StringType =
+new StringIndexer(uid)
+  .setInputCol(responseName)
+  .setOutputCol($(labelCol))
+  .fit(dataset)
+  .transform(dataset)
+  case other =
+throw new IllegalArgumentException(Unsupported type for response: 
 + other)
+}
+  }
+
+  protected def featureTransformer: Transformer = {
+// TODO(ekl) add support for non-numeric features and feature 
interactions
+new VectorAssembler(uid)
+  .setInputCols(parsedFormula.get.terms.toArray)
+  .setOutputCol($(featuresCol))
+  }
+}
+
+/**
+ * :: Experimental ::
--- End diff --

We don't need `:: Experimental ::` on private classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34618021
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
+assert(resultSchema.toString == expected.schema.toString)
+assert(
+  result.collect.map(_.toString).mkString(,) ==
--- End diff --

`collect` - `collect()` (because it is an action). `collect` doesn't 
really guarantee the ordering. So it would be nice to put the expected result 
along with the input data as extra columns. Then make assertions on each record.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617985
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  protected def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
+  case StringType =
+new StringIndexer(uid)
+  .setInputCol(responseName)
+  .setOutputCol($(labelCol))
+  .fit(dataset)
+  .transform(dataset)
+  case other =
+throw new IllegalArgumentException(Unsupported type for response: 
 + other)
+}
+  }
+
+  protected def featureTransformer: Transformer = {
+// TODO(ekl) add support for non-numeric features and feature 
interactions
+new VectorAssembler(uid)
+  .setInputCols(parsedFormula.get.terms.toArray)
+  .setOutputCol($(featuresCol))
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Represents a parsed R formula.
+ */
+private[ml] case class RFormula(response: String, terms: Seq[String])
+
+/**
+ * :: Experimental ::
+ * Limited implementation of R formula parsing. Currently supports: '~', 
'+'.
+ */
+private[ml] object RFormulaParser extends RegexParsers {
+  def term: Parser[String] = ([a-zA-Z]|\\.[a-zA-Z_])[a-zA-Z0-9._]*.r
--- End diff --

Does R accept `$` in terms?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this 

[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34617695
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
--- End diff --

Also mention the operators we support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632802
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632796
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121430082
  
  [Test build #37282 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37282/consoleFull)
 for   PR 7381 at commit 
[`dc3c943`](https://github.com/apache/spark/commit/dc3c943a9e3167cd419451b3d83a720db5152b23).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121429679
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121429694
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632816
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  protected def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
+  case StringType =
+new StringIndexer(uid)
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632827
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -116,7 +116,7 @@ class VectorAssembler(override val uid: String)
 if (schema.fieldNames.contains(outputColName)) {
   throw new IllegalArgumentException(sOutput column $outputColName 
already exists.)
 }
-StructType(schema.fields :+ new StructField(outputColName, new 
VectorUDT, false))
+StructType(schema.fields :+ new StructField(outputColName, new 
VectorUDT, true))
--- End diff --

I noticed that the schema of transform() has it as nullable, so probably 
transformSchema() should also. One alternative is to make transform() mark the 
vector as non-null, but I am not exactly sure how to do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632829
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
--- End diff --

I put it here since the parser is basically private to RModelFormula but 
could be convinced otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632824
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  protected def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
+  case StringType =
+new StringIndexer(uid)
+  .setInputCol(responseName)
+  .setOutputCol($(labelCol))
+  .fit(dataset)
+  .transform(dataset)
+  case other =
+throw new IllegalArgumentException(Unsupported type for response: 
 + other)
+}
+  }
+
+  protected def featureTransformer: Transformer = {
+// TODO(ekl) add support for non-numeric features and feature 
interactions
+new VectorAssembler(uid)
+  .setInputCols(parsedFormula.get.terms.toArray)
+  .setOutputCol($(featuresCol))
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Represents a parsed R formula.
+ */
+private[ml] case class RFormula(response: String, terms: Seq[String])
+
+/**
+ * :: Experimental ::
+ * Limited implementation of R formula parsing. Currently supports: '~', 
'+'.
+ */
+private[ml] object RFormulaParser extends RegexParsers {
+  def term: Parser[String] = ([a-zA-Z]|\\.[a-zA-Z_])[a-zA-Z0-9._]*.r
--- End diff --

Looks like R supports arbitrary expressions in terms, so we'd need a full 
parser to be sure. For $ am I not sure it makes sense, since we assume the 
terms are from the 

[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632806
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632795
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632811
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  protected def transformLabel(dataset: DataFrame): DataFrame = {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632800
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632803
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121433780
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637543
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
--- End diff --

use `===` instead of `==` (and please update other `==`s)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637538
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  private def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
--- End diff --

response, target, or label are all valid names. In MLlib, we use 
label. So it might be useful to rename response to label.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637540
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  private def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
--- End diff --

What if the `responseName` is the same as `labelCol`? This may cause 
unexpected behavior. If the input is `DoubleType`, we should allow `labelCol` 
be the same as the target term in the formula. If we need to do transformation, 
then user should set a different `labelCol`. We can set the default 
`featuresCol` and `labelCol` based on the uid and hence it won't have name 
collision. I don't think this is a good solution, but I don't have good 
suggestions.

Btw, we can use `DataFrame.withColumn` to append a new column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If 

[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637533
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
--- End diff --

`col` - `value` (to be consistent with other setters. Since the method 
name already contains this info, it is not necessary to repeat that for the 
arg, especially for really long names)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637537
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
--- End diff --

minor: `${get(formula))` - `$getFormula` (slightly easier to read)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637535
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
--- End diff --

To be consistent, rename `featureTransformer` to `transformFeatures`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637529
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
--- End diff --

Use 
`http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html` 
instead, which is in the raw R manual format.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637545
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
+assert(resultSchema.toString == expected.schema.toString)
+assert(
+  result.collect().map(_.toString).sorted.mkString(,) ==
--- End diff --

I don't think we need `toString` and `mkString(,)`. Maybe `sorted` is not 
necessary either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34637544
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
--- End diff --

Maybe it is worth leaving a TODO here for `DataType.equals`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632817
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against a R 
model formula.
+ */
+@Experimental
+private[spark] class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+  protected var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @param value a R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  protected def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
+  case StringType =
+new StringIndexer(uid)
+  .setInputCol(responseName)
+  .setOutputCol($(labelCol))
+  .fit(dataset)
+  .transform(dataset)
+  case other =
+throw new IllegalArgumentException(Unsupported type for response: 
 + other)
+}
+  }
+
+  protected def featureTransformer: Transformer = {
+// TODO(ekl) add support for non-numeric features and feature 
interactions
+new VectorAssembler(uid)
+  .setInputCols(parsedFormula.get.terms.toArray)
+  .setOutputCol($(featuresCol))
+  }
+}
+
+/**
+ * :: Experimental ::
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34632833
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
+assert(resultSchema.toString == expected.schema.toString)
+assert(
+  result.collect.map(_.toString).mkString(,) ==
--- End diff --

Do you know the right way to compare schemas / Rows for equality? It seems 
equals() is not implemented for either.

Also added sorted to fix the ordering issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121433740
  
  [Test build #37282 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37282/console)
 for   PR 7381 at commit 
[`dc3c943`](https://github.com/apache/spark/commit/dc3c943a9e3167cd419451b3d83a720db5152b23).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RModelFormula(override val uid: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34643356
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
--- End diff --

Just figured out why. The column output from `VectorAssembler` also 
contains ML attributes that stores feature names. It is not included in 
`toString` ... If you compare the JSON value, you see:

~~~scala

metadata:{[ml_attr:{attrs:{numeric:[{idx:0,name:v1},{idx:1,name:v2}]},num_attrs:2}]}
~~~

from the output. So I think the correct TODO message is also check 
metadata.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34643345
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -116,7 +116,7 @@ class VectorAssembler(override val uid: String)
 if (schema.fieldNames.contains(outputColName)) {
   throw new IllegalArgumentException(sOutput column $outputColName 
already exists.)
 }
-StructType(schema.fields :+ new StructField(outputColName, new 
VectorUDT, false))
+StructType(schema.fields :+ new StructField(outputColName, new 
VectorUDT, true))
--- End diff --

Okay, I think this is minor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34643347
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
--- End diff --

remove unused imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34643358
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
+  val parsed = RFormulaParser.parse(formula)
+  assert(parsed.response == response)
+  assert(parsed.terms == terms)
+}
+check(y ~ x, y, Seq(x))
+check(y ~   ._foo  , y, Seq(._foo))
+check(resp ~ A_VAR + B + c123, resp, Seq(A_VAR, B, c123))
+  }
+
+  test(transform numeric data) {
+val formula = new RModelFormula().setFormula(id ~ v1 + v2)
+val original = sqlContext.createDataFrame(
+  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(id, v1, v2)
+val result = formula.transform(original)
+val resultSchema = formula.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(0, 1.0, 3.0, Vectors.dense(Array(1.0, 3.0)), 0.0),
+(2, 2.0, 5.0, Vectors.dense(Array(2.0, 5.0)), 2.0))
+  ).toDF(id, v1, v2, features, label)
+assert(result.schema.toString == resultSchema.toString)
+assert(resultSchema.toString == expected.schema.toString)
+assert(
+  result.collect().map(_.toString).sorted.mkString(,) ==
--- End diff --

`assert(result.collect() === expected.collect())` works for me. Note that 
`===` works but `==` doesn't.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34643348
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
--- End diff --

`RFormulaModelSuite` - `RModelFormulaSuite`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34643578
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RModelFormulaSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+
+class RFormulaModelSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test(params) {
+ParamsSuite.checkParams(new RModelFormula())
+  }
+
+  test(parse simple formulas) {
+def check(formula: String, response: String, terms: Seq[String]) {
+  new RModelFormula().setFormula(formula)
--- End diff --

Whether to test private class or not might result much longer discussion:) 
In MLlib, usually we expose few public APIs, while the implementation might 
consist of several pieces. It is useful to test each piece individually though 
they are not public. For example, in ALS, 
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala#L53,
 it is hard to make useful unit test without unit testing individual components.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121485499
  
@ericl I make another pass. The major issue is actually that 
`RModelFormula` should be an `Estimator` instead of a `Transformer` in order to 
handle String columns. It requires some changes to the current implementation. 
So I would suggest removing the support for string labels in this PR and 
address it in a follow-up PR, since we already reviewed most of the code. It is 
okay to just comment out the test. Does it sound good to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7381#discussion_r34643726
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RModelFormula.scala ---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.parsing.combinator.RegexParsers
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * :: Experimental ::
+ * Implements the transforms required for fitting a dataset against an R 
model formula. Currently
+ * we support a limited subset of the R operators, including '~' and '+'. 
Also see the R formula
+ * docs here: http://www.inside-r.org/r-doc/stats/formula
+ */
+@Experimental
+class RModelFormula(override val uid: String)
+  extends Transformer with HasFeaturesCol with HasLabelCol {
+
+  def this() = this(Identifiable.randomUID(rModelFormula))
+
+  /**
+   * R formula parameter. The formula is provided in string form.
+   * @group setParam
+   */
+  val formula: Param[String] = new Param(this, formula, R model 
formula)
+
+  private var parsedFormula: Option[RFormula] = None
+
+  /**
+   * Sets the formula to use for this transformer. Must be called before 
use.
+   * @group setParam
+   * @param value an R formula in string form (e.g. y ~ x + z)
+   */
+  def setFormula(value: String): this.type = {
+parsedFormula = Some(RFormulaParser.parse(value))
+set(formula, value)
+this
+  }
+
+  /** @group getParam */
+  def getFormula: String = $(formula)
+
+  /** @group getParam */
+  def setFeaturesCol(col: String): this.type = set(featuresCol, col)
+
+  /** @group getParam */
+  def setLabelCol(col: String): this.type = set(labelCol, col)
+
+  override def transformSchema(schema: StructType): StructType = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val withFeatures = featureTransformer.transformSchema(schema)
+val nullable = schema(parsedFormula.get.response).dataType match {
+  case _: NumericType | BooleanType = false
+  case _ = true
+}
+StructType(withFeatures.fields :+ StructField($(labelCol), DoubleType, 
nullable))
+  }
+
+  override def transform(dataset: DataFrame): DataFrame = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+transformLabel(featureTransformer.transform(dataset))
+  }
+
+  override def copy(extra: ParamMap): RModelFormula = defaultCopy(extra)
+
+  override def toString: String = sRModelFormula(${get(formula)})
+
+  private def transformLabel(dataset: DataFrame): DataFrame = {
+val responseName = parsedFormula.get.response
+dataset.schema(responseName).dataType match {
+  case _: NumericType | BooleanType =
+dataset.select(
+  col(*),
+  dataset(responseName).cast(DoubleType).as($(labelCol)))
+  case StringType =
+new StringIndexer()
--- End diff --

It might be necessary to implement `RModelFormula` as an `Estimator`. 
Otherwise, this StringIndexer() will be called every time when `transform` is 
called. If the input dataset is different, it would result different answers. 
For this PR, how about removing support for string labels. In a follow-up PR, 
we can make `RModelFormula` as an `Estimator`, whose `fit` returns a 
`RModelFormulaModel` ... (The name is awkward. Maybe we should call `RFormula` 
and `RFormulaModel` instead.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on 

[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121099053
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121102812
  
  [Test build #37170 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37170/consoleFull)
 for   PR 7381 at commit 
[`5765ec6`](https://github.com/apache/spark/commit/5765ec6ace737049c91a1096f3e5c4670a2b19f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121099480
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121107200
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121107144
  
  [Test build #37170 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37170/console)
 for   PR 7381 at commit 
[`5765ec6`](https://github.com/apache/spark/commit/5765ec6ace737049c91a1096f3e5c4670a2b19f2).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121102610
  
  [Test build #37167 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37167/consoleFull)
 for   PR 7381 at commit 
[`1f361b0`](https://github.com/apache/spark/commit/1f361b0e0f6a7de12a39bc1b75fd59f6a7128ab8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121102116
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread ericl
GitHub user ericl opened a pull request:

https://github.com/apache/spark/pull/7381

[SPARK-8774] [ML] Add R model formula with basic support as a transformer

This implements minimal R formula support as a feature transformer. Both 
numeric and string labels are supported, but features must be numeric for now.

cc @mengxr 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ericl/spark spark-8774-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7381.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7381


commit fb0826b875d8cda29dce6ec6654cdf0f66ac958f
Author: Eric Liang e...@databricks.com
Date:   2015-07-14T00:32:11Z

[SPARK-8774] Add R model formula with basic support as a transformer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121099302
  
  [Test build #37166 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37166/consoleFull)
 for   PR 7381 at commit 
[`fb0826b`](https://github.com/apache/spark/commit/fb0826b875d8cda29dce6ec6654cdf0f66ac958f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121099478
  
  [Test build #37166 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37166/console)
 for   PR 7381 at commit 
[`fb0826b`](https://github.com/apache/spark/commit/fb0826b875d8cda29dce6ec6654cdf0f66ac958f).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121102765
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121102101
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121102778
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121099063
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121108982
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8774] [ML] Add R model formula with bas...

2015-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7381#issuecomment-121108947
  
  [Test build #37167 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37167/console)
 for   PR 7381 at commit 
[`1f361b0`](https://github.com/apache/spark/commit/1f361b0e0f6a7de12a39bc1b75fd59f6a7128ab8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org