[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125295030
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125295005
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125294308
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125300346
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35596332
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -114,25 +177,29 @@ class RFormula(override val uid: String)
 }
   }
 
-  private def transformFeatures: Transformer = {
-// TODO(ekl) add support for non-numeric features and feature 
interactions
-new VectorAssembler(uid)
-  .setInputCols(parsedFormula.get.terms.toArray)
-  .setOutputCol($(featuresCol))
-  }
-
   private def checkCanTransform(schema: StructType) {
-require(parsedFormula.isDefined, Must call setFormula() first.)
 val columnNames = schema.map(_.name)
 require(!columnNames.contains($(featuresCol)), Features column 
already exists.)
 require(
   !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType 
== DoubleType,
   Label column already exists and is not of type DoubleType.)
   }
+}
 
-  private def hasLabelCol(schema: StructType): Boolean = {
-schema.map(_.name).contains($(labelCol))
+/**
+ * Utility transformer for removing temporary columns from a DataFrame.
+ * TODO(ekl) make this a public transformer
+ */
+private class ColumnPruner(columnsToPrune: Set[String]) extends 
Transformer {
+  override val uid = Identifiable.randomUID(columnPruner)
+  override def transform(dataset: DataFrame): DataFrame = {
--- End diff --

insert an empty line between methods definitions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35596324
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -62,19 +77,72 @@ class RFormula(override val uid: String)
   /** @group getParam */
   def getFormula: String = $(formula)
 
-  /** @group getParam */
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def fit(dataset: DataFrame): RFormulaModel = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+// StringType terms and terms representing interactions need to be 
encoded before assembly.
+// TODO(ekl) add support for feature interactions
+var encoderStages = Seq[PipelineStage]()
+var tempColumns = Seq[String]()
+val encodedTerms = parsedFormula.get.terms.map { term =
+  dataset.schema(term) match {
+case column if column.dataType == StringType =
+  val indexCol = term + _idx_ + uid
+  val encodedCol = term + _onehot_ + uid
+  encoderStages :+= new 
StringIndexer().setInputCol(term).setOutputCol(indexCol)
+  encoderStages :+= new 
OneHotEncoder().setInputCol(indexCol).setOutputCol(encodedCol)
+  tempColumns :+= indexCol
+  tempColumns :+= encodedCol
+  encodedCol
+case _ =
+  term
+  }
+}
+encoderStages :+= new VectorAssembler(uid)
+  .setInputCols(encodedTerms.toArray)
+  .setOutputCol($(featuresCol))
+encoderStages :+= new ColumnPruner(tempColumns.toSet)
+val pipelineModel = new 
Pipeline(uid).setStages(encoderStages.toArray).fit(dataset)
+copyValues(new RFormulaModel(uid, parsedFormula.get, 
pipelineModel).setParent(this))
+  }
 
-  /** @group getParam */
-  def setLabelCol(value: String): this.type = set(labelCol, value)
+  // optimistic schema; does not contain any ML attributes
+  override def transformSchema(schema: StructType): StructType = {
+if (hasLabelCol(schema)) {
+  StructType(schema.fields :+ StructField($(featuresCol), new 
VectorUDT, true))
+} else {
+  StructType(schema.fields :+ StructField($(featuresCol), new 
VectorUDT, true) :+
+StructField($(labelCol), DoubleType, true))
+}
+  }
+
+  override def copy(extra: ParamMap): RFormula = defaultCopy(extra)
+
+  override def toString: String = sRFormula(${get(formula)})
+}
+
+/**
+ * A fitted RFormula. Fitting is required to determine the factor levels 
of formula terms.
+ * @param parsedFormula a pre-parsed R formula.
+ * @param pipelineModel the fitted feature model, including factor to 
index mappings.
+ */
+private[feature] class RFormulaModel(
--- End diff --

The class should be public because it appears in `RFormula.fit`, which is a 
public API. The constructor should be package private instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35596320
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -62,19 +77,72 @@ class RFormula(override val uid: String)
   /** @group getParam */
   def getFormula: String = $(formula)
 
-  /** @group getParam */
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def fit(dataset: DataFrame): RFormulaModel = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+// StringType terms and terms representing interactions need to be 
encoded before assembly.
+// TODO(ekl) add support for feature interactions
+var encoderStages = Seq[PipelineStage]()
--- End diff --

minor: `Seq` could be replaced by `ArrayBuffer` to avoid creating temp 
sequences. Then `:+=` below becomes `+=`, slightly simpler to read.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125372454
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125372440
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125376053
  
  [Test build #38602 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38602/consoleFull)
 for   PR 7574 at commit 
[`f99131a`](https://github.com/apache/spark/commit/f99131ae1fcc5f84035cef20ad5d6231a38712d3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125375942
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125375931
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35598513
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  formula.fit(original)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  formula.fit(original)
 }
   }
 
   test(label column already exists) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
   test(label column already exists but is not double type) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1), (2, 
2))).toDF(x, y)
+val model = formula.fit(original)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  model.transformSchema(original.schema)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  model.transform(original)
 }
   }
 
   test(allow missing label column for test datasets) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(label)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, _not_y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
 assert(!resultSchema.exists(_.name == label))
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
-// TODO(ekl) enable after we implement string label support
-//  test(transform string label) {
-//val formula = new RFormula().setFormula(name ~ id)
-//val original = sqlContext.createDataFrame(
-//  Seq((1, foo), (2, bar), (3, bar))).toDF(id, name)
-//val result = formula.transform(original)
-//val resultSchema = formula.transformSchema(original.schema)
-//val expected = sqlContext.createDataFrame(
-//  Seq(
-//(1, foo, Vectors.dense(Array(1.0)), 1.0),
-//(2, bar, Vectors.dense(Array(2.0)), 0.0),
-//(3, bar, Vectors.dense(Array(3.0)), 0.0))
-//  ).toDF(id, name, features, label)
-//assert(result.schema.toString == resultSchema.toString)
-//assert(result.collect().toSeq == expected.collect().toSeq)
-//  }
+  test(encodes string terms) {
+val formula = new RFormula().setFormula(id ~ a + b)
+val original = sqlContext.createDataFrame(
+  Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 
5))).toDF(id, a, b)
+val model = formula.fit(original)
+val result = model.transform(original)
+val resultSchema = model.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(1, foo, 4, Vectors.dense(Array(0.0, 1.0, 4.0)), 1.0),
+(2, bar, 4, Vectors.dense(Array(1.0, 0.0, 4.0)), 2.0),
+(3, bar, 5, Vectors.dense(Array(1.0, 0.0, 5.0)), 3.0),
+(4, baz, 5, Vectors.dense(Array(0.0, 0.0, 5.0)), 4.0))
+  ).toDF(id, a, b, features, label)
+assert(result.schema.toString == resultSchema.toString)
+assert(result.collect().toSeq == expected.collect().toSeq)
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35598495
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -114,25 +177,29 @@ class RFormula(override val uid: String)
 }
   }
 
-  private def transformFeatures: Transformer = {
-// TODO(ekl) add support for non-numeric features and feature 
interactions
-new VectorAssembler(uid)
-  .setInputCols(parsedFormula.get.terms.toArray)
-  .setOutputCol($(featuresCol))
-  }
-
   private def checkCanTransform(schema: StructType) {
-require(parsedFormula.isDefined, Must call setFormula() first.)
 val columnNames = schema.map(_.name)
 require(!columnNames.contains($(featuresCol)), Features column 
already exists.)
 require(
   !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType 
== DoubleType,
   Label column already exists and is not of type DoubleType.)
   }
+}
 
-  private def hasLabelCol(schema: StructType): Boolean = {
-schema.map(_.name).contains($(labelCol))
+/**
+ * Utility transformer for removing temporary columns from a DataFrame.
+ * TODO(ekl) make this a public transformer
+ */
+private class ColumnPruner(columnsToPrune: Set[String]) extends 
Transformer {
+  override val uid = Identifiable.randomUID(columnPruner)
+  override def transform(dataset: DataFrame): DataFrame = {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35598510
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  formula.fit(original)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  formula.fit(original)
 }
   }
 
   test(label column already exists) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
   test(label column already exists but is not double type) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1), (2, 
2))).toDF(x, y)
+val model = formula.fit(original)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  model.transformSchema(original.schema)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  model.transform(original)
 }
   }
 
   test(allow missing label column for test datasets) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(label)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, _not_y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
 assert(!resultSchema.exists(_.name == label))
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
-// TODO(ekl) enable after we implement string label support
-//  test(transform string label) {
-//val formula = new RFormula().setFormula(name ~ id)
-//val original = sqlContext.createDataFrame(
-//  Seq((1, foo), (2, bar), (3, bar))).toDF(id, name)
-//val result = formula.transform(original)
-//val resultSchema = formula.transformSchema(original.schema)
-//val expected = sqlContext.createDataFrame(
-//  Seq(
-//(1, foo, Vectors.dense(Array(1.0)), 1.0),
-//(2, bar, Vectors.dense(Array(2.0)), 0.0),
-//(3, bar, Vectors.dense(Array(3.0)), 0.0))
-//  ).toDF(id, name, features, label)
-//assert(result.schema.toString == resultSchema.toString)
-//assert(result.collect().toSeq == expected.collect().toSeq)
-//  }
+  test(encodes string terms) {
+val formula = new RFormula().setFormula(id ~ a + b)
+val original = sqlContext.createDataFrame(
+  Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 
5))).toDF(id, a, b)
+val model = formula.fit(original)
+val result = model.transform(original)
+val resultSchema = model.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(1, foo, 4, Vectors.dense(Array(0.0, 1.0, 4.0)), 1.0),
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35598489
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -62,19 +77,72 @@ class RFormula(override val uid: String)
   /** @group getParam */
   def getFormula: String = $(formula)
 
-  /** @group getParam */
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def fit(dataset: DataFrame): RFormulaModel = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+// StringType terms and terms representing interactions need to be 
encoded before assembly.
+// TODO(ekl) add support for feature interactions
+var encoderStages = Seq[PipelineStage]()
+var tempColumns = Seq[String]()
+val encodedTerms = parsedFormula.get.terms.map { term =
+  dataset.schema(term) match {
+case column if column.dataType == StringType =
+  val indexCol = term + _idx_ + uid
+  val encodedCol = term + _onehot_ + uid
+  encoderStages :+= new 
StringIndexer().setInputCol(term).setOutputCol(indexCol)
+  encoderStages :+= new 
OneHotEncoder().setInputCol(indexCol).setOutputCol(encodedCol)
+  tempColumns :+= indexCol
+  tempColumns :+= encodedCol
+  encodedCol
+case _ =
+  term
+  }
+}
+encoderStages :+= new VectorAssembler(uid)
+  .setInputCols(encodedTerms.toArray)
+  .setOutputCol($(featuresCol))
+encoderStages :+= new ColumnPruner(tempColumns.toSet)
+val pipelineModel = new 
Pipeline(uid).setStages(encoderStages.toArray).fit(dataset)
+copyValues(new RFormulaModel(uid, parsedFormula.get, 
pipelineModel).setParent(this))
+  }
 
-  /** @group getParam */
-  def setLabelCol(value: String): this.type = set(labelCol, value)
+  // optimistic schema; does not contain any ML attributes
+  override def transformSchema(schema: StructType): StructType = {
+if (hasLabelCol(schema)) {
+  StructType(schema.fields :+ StructField($(featuresCol), new 
VectorUDT, true))
+} else {
+  StructType(schema.fields :+ StructField($(featuresCol), new 
VectorUDT, true) :+
+StructField($(labelCol), DoubleType, true))
+}
+  }
+
+  override def copy(extra: ParamMap): RFormula = defaultCopy(extra)
+
+  override def toString: String = sRFormula(${get(formula)})
+}
+
+/**
+ * A fitted RFormula. Fitting is required to determine the factor levels 
of formula terms.
+ * @param parsedFormula a pre-parsed R formula.
+ * @param pipelineModel the fitted feature model, including factor to 
index mappings.
+ */
+private[feature] class RFormulaModel(
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125371041
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35596483
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  formula.fit(original)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  formula.fit(original)
 }
   }
 
   test(label column already exists) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
   test(label column already exists but is not double type) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1), (2, 
2))).toDF(x, y)
+val model = formula.fit(original)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  model.transformSchema(original.schema)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  model.transform(original)
 }
   }
 
   test(allow missing label column for test datasets) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(label)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, _not_y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
 assert(!resultSchema.exists(_.name == label))
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
-// TODO(ekl) enable after we implement string label support
-//  test(transform string label) {
-//val formula = new RFormula().setFormula(name ~ id)
-//val original = sqlContext.createDataFrame(
-//  Seq((1, foo), (2, bar), (3, bar))).toDF(id, name)
-//val result = formula.transform(original)
-//val resultSchema = formula.transformSchema(original.schema)
-//val expected = sqlContext.createDataFrame(
-//  Seq(
-//(1, foo, Vectors.dense(Array(1.0)), 1.0),
-//(2, bar, Vectors.dense(Array(2.0)), 0.0),
-//(3, bar, Vectors.dense(Array(3.0)), 0.0))
-//  ).toDF(id, name, features, label)
-//assert(result.schema.toString == resultSchema.toString)
-//assert(result.collect().toSeq == expected.collect().toSeq)
-//  }
+  test(encodes string terms) {
+val formula = new RFormula().setFormula(id ~ a + b)
+val original = sqlContext.createDataFrame(
+  Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 
5))).toDF(id, a, b)
--- End diff --

minor: move `).toDF(...)` to next line for readability


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35596486
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  formula.fit(original)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  formula.fit(original)
 }
   }
 
   test(label column already exists) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
   test(label column already exists but is not double type) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1), (2, 
2))).toDF(x, y)
+val model = formula.fit(original)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  model.transformSchema(original.schema)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  model.transform(original)
 }
   }
 
   test(allow missing label column for test datasets) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(label)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, _not_y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
 assert(!resultSchema.exists(_.name == label))
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
-// TODO(ekl) enable after we implement string label support
-//  test(transform string label) {
-//val formula = new RFormula().setFormula(name ~ id)
-//val original = sqlContext.createDataFrame(
-//  Seq((1, foo), (2, bar), (3, bar))).toDF(id, name)
-//val result = formula.transform(original)
-//val resultSchema = formula.transformSchema(original.schema)
-//val expected = sqlContext.createDataFrame(
-//  Seq(
-//(1, foo, Vectors.dense(Array(1.0)), 1.0),
-//(2, bar, Vectors.dense(Array(2.0)), 0.0),
-//(3, bar, Vectors.dense(Array(3.0)), 0.0))
-//  ).toDF(id, name, features, label)
-//assert(result.schema.toString == resultSchema.toString)
-//assert(result.collect().toSeq == expected.collect().toSeq)
-//  }
+  test(encodes string terms) {
+val formula = new RFormula().setFormula(id ~ a + b)
+val original = sqlContext.createDataFrame(
+  Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 
5))).toDF(id, a, b)
+val model = formula.fit(original)
+val result = model.transform(original)
+val resultSchema = model.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(1, foo, 4, Vectors.dense(Array(0.0, 1.0, 4.0)), 1.0),
--- End diff --

`Array(...)` is not necessary. `Vectors.dense` takes varargs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35596488
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  formula.fit(original)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  formula.fit(original)
 }
   }
 
   test(label column already exists) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
   test(label column already exists but is not double type) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1), (2, 
2))).toDF(x, y)
+val model = formula.fit(original)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  model.transformSchema(original.schema)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  model.transform(original)
 }
   }
 
   test(allow missing label column for test datasets) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(label)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, _not_y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
 assert(!resultSchema.exists(_.name == label))
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
-// TODO(ekl) enable after we implement string label support
-//  test(transform string label) {
-//val formula = new RFormula().setFormula(name ~ id)
-//val original = sqlContext.createDataFrame(
-//  Seq((1, foo), (2, bar), (3, bar))).toDF(id, name)
-//val result = formula.transform(original)
-//val resultSchema = formula.transformSchema(original.schema)
-//val expected = sqlContext.createDataFrame(
-//  Seq(
-//(1, foo, Vectors.dense(Array(1.0)), 1.0),
-//(2, bar, Vectors.dense(Array(2.0)), 0.0),
-//(3, bar, Vectors.dense(Array(3.0)), 0.0))
-//  ).toDF(id, name, features, label)
-//assert(result.schema.toString == resultSchema.toString)
-//assert(result.collect().toSeq == expected.collect().toSeq)
-//  }
+  test(encodes string terms) {
+val formula = new RFormula().setFormula(id ~ a + b)
+val original = sqlContext.createDataFrame(
+  Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 
5))).toDF(id, a, b)
+val model = formula.fit(original)
+val result = model.transform(original)
+val resultSchema = model.transformSchema(original.schema)
+val expected = sqlContext.createDataFrame(
+  Seq(
+(1, foo, 4, Vectors.dense(Array(0.0, 1.0, 4.0)), 1.0),
+(2, bar, 4, Vectors.dense(Array(1.0, 0.0, 4.0)), 2.0),
+(3, bar, 5, Vectors.dense(Array(1.0, 0.0, 5.0)), 3.0),
+(4, baz, 5, Vectors.dense(Array(0.0, 0.0, 5.0)), 4.0))
+  ).toDF(id, a, b, features, label)
+assert(result.schema.toString == resultSchema.toString)
+assert(result.collect().toSeq == expected.collect().toSeq)
--- End diff --

minor: Again, if you use `===` instead of `==`, we can remove `toSeq`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35598503
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  formula.fit(original)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  formula.fit(original)
 }
   }
 
   test(label column already exists) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
   test(label column already exists but is not double type) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(y)
 val original = sqlContext.createDataFrame(Seq((0, 1), (2, 
2))).toDF(x, y)
+val model = formula.fit(original)
 intercept[IllegalArgumentException] {
-  formula.transformSchema(original.schema)
+  model.transformSchema(original.schema)
 }
 intercept[IllegalArgumentException] {
-  formula.transform(original)
+  model.transform(original)
 }
   }
 
   test(allow missing label column for test datasets) {
 val formula = new RFormula().setFormula(y ~ x).setLabelCol(label)
 val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 
2.0))).toDF(x, _not_y)
-val resultSchema = formula.transformSchema(original.schema)
+val model = formula.fit(original)
+val resultSchema = model.transformSchema(original.schema)
 assert(resultSchema.length == 3)
 assert(!resultSchema.exists(_.name == label))
-assert(resultSchema.toString == 
formula.transform(original).schema.toString)
+assert(resultSchema.toString == 
model.transform(original).schema.toString)
   }
 
-// TODO(ekl) enable after we implement string label support
-//  test(transform string label) {
-//val formula = new RFormula().setFormula(name ~ id)
-//val original = sqlContext.createDataFrame(
-//  Seq((1, foo), (2, bar), (3, bar))).toDF(id, name)
-//val result = formula.transform(original)
-//val resultSchema = formula.transformSchema(original.schema)
-//val expected = sqlContext.createDataFrame(
-//  Seq(
-//(1, foo, Vectors.dense(Array(1.0)), 1.0),
-//(2, bar, Vectors.dense(Array(2.0)), 0.0),
-//(3, bar, Vectors.dense(Array(3.0)), 0.0))
-//  ).toDF(id, name, features, label)
-//assert(result.schema.toString == resultSchema.toString)
-//assert(result.collect().toSeq == expected.collect().toSeq)
-//  }
+  test(encodes string terms) {
+val formula = new RFormula().setFormula(id ~ a + b)
+val original = sqlContext.createDataFrame(
+  Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 
5))).toDF(id, a, b)
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35598479
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -62,19 +77,72 @@ class RFormula(override val uid: String)
   /** @group getParam */
   def getFormula: String = $(formula)
 
-  /** @group getParam */
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def fit(dataset: DataFrame): RFormulaModel = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+// StringType terms and terms representing interactions need to be 
encoded before assembly.
+// TODO(ekl) add support for feature interactions
+var encoderStages = Seq[PipelineStage]()
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125373141
  
  [Test build #38597 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38597/consoleFull)
 for   PR 7574 at commit 
[`0bf3c26`](https://github.com/apache/spark/commit/0bf3c2630d20408234bef9fe6358a4cca9952125).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125376595
  
LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125396509
  
  [Test build #38597 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38597/console)
 for   PR 7574 at commit 
[`0bf3c26`](https://github.com/apache/spark/commit/0bf3c2630d20408234bef9fe6358a4cca9952125).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RFormula(override val uid: String) extends 
Estimator[RFormulaModel] with RFormulaBase `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125396596
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125380508
  
  [Test build #38602 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38602/console)
 for   PR 7574 at commit 
[`f99131a`](https://github.com/apache/spark/commit/f99131ae1fcc5f84035cef20ad5d6231a38712d3).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RFormula(override val uid: String) extends 
Estimator[RFormulaModel] with RFormulaBase `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125384454
  
Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7574


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-125380557
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-25 Thread ericl
Github user ericl commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124916428
  
ptal


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35397462
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -130,9 +173,52 @@ class RFormula(override val uid: String)
   Label column already exists and is not of type DoubleType.)
   }
 
-  private def hasLabelCol(schema: StructType): Boolean = {
-schema.map(_.name).contains($(labelCol))
+  private def featureTransformer(schema: StructType): Transformer = {
+// StringType terms and terms representing interactions need to be 
encoded before assembly.
+// TODO(ekl) add support for feature interactions
+var encoderStages = Seq[Transformer]()
+var tempColumns = Seq[String]()
+val encodedTerms = parsedFormula.terms.map { term =
+  schema(term) match {
+case column if column.dataType == StringType =
+  val encodedTerm = term + _onehot_ + uid
+  val indexer = factorLevels(term)
+  val indexCol = indexer.getOrDefault(indexer.outputCol)
+  encoderStages :+= indexer
+  encoderStages :+= new OneHotEncoder()
+.setInputCol(indexCol)
+.setOutputCol(encodedTerm)
+  tempColumns :+= encodedTerm
+  tempColumns :+= indexCol
+  encodedTerm
+case _ =
+  term
+  }
+}
+encoderStages :+= new VectorAssembler(uid)
+  .setInputCols(encodedTerms.toArray)
+  .setOutputCol($(featuresCol))
+encoderStages :+= new ColumnPruner(tempColumns.toSet)
+new PipelineModel(uid, encoderStages.toArray)
+  }
+}
+
+/**
+ * Utility transformer for removing temporary columns from a DataFrame.
+ */
+private class ColumnPruner(columnsToPrune: Set[String]) extends 
Transformer {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35397464
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -130,9 +173,52 @@ class RFormula(override val uid: String)
   Label column already exists and is not of type DoubleType.)
   }
 
-  private def hasLabelCol(schema: StructType): Boolean = {
-schema.map(_.name).contains($(labelCol))
+  private def featureTransformer(schema: StructType): Transformer = {
+// StringType terms and terms representing interactions need to be 
encoded before assembly.
+// TODO(ekl) add support for feature interactions
+var encoderStages = Seq[Transformer]()
+var tempColumns = Seq[String]()
+val encodedTerms = parsedFormula.terms.map { term =
+  schema(term) match {
+case column if column.dataType == StringType =
+  val encodedTerm = term + _onehot_ + uid
+  val indexer = factorLevels(term)
+  val indexCol = indexer.getOrDefault(indexer.outputCol)
+  encoderStages :+= indexer
+  encoderStages :+= new OneHotEncoder()
+.setInputCol(indexCol)
+.setOutputCol(encodedTerm)
+  tempColumns :+= encodedTerm
+  tempColumns :+= indexCol
+  encodedTerm
+case _ =
+  term
+  }
+}
+encoderStages :+= new VectorAssembler(uid)
+  .setInputCols(encodedTerms.toArray)
+  .setOutputCol($(featuresCol))
+encoderStages :+= new ColumnPruner(tempColumns.toSet)
+new PipelineModel(uid, encoderStages.toArray)
+  }
+}
+
+/**
+ * Utility transformer for removing temporary columns from a DataFrame.
+ */
+private class ColumnPruner(columnsToPrune: Set[String]) extends 
Transformer {
+  override val uid = Identifiable.randomUID(columnPruner)
+  override def transform(dataset: DataFrame): DataFrame = {
+var res: DataFrame = dataset
+for (column - columnsToPrune) {
+  res = res.drop(column)
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35397461
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -62,19 +77,60 @@ class RFormula(override val uid: String)
   /** @group getParam */
   def getFormula: String = $(formula)
 
-  /** @group getParam */
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def fit(dataset: DataFrame): RFormulaModel = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val factorLevels = parsedFormula.get.terms.flatMap { term =
+  dataset.schema(term) match {
+case column if column.dataType == StringType =
+  val idxTerm = term + _idx_ + uid
+  val indexer = new 
StringIndexer().setInputCol(term).setOutputCol(idxTerm)
+  Some(term - indexer.fit(dataset))
+case _ =
+  None
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124346736
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124346148
  
  [Test build #38316 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38316/console)
 for   PR 7574 at commit 
[`c302a2c`](https://github.com/apache/spark/commit/c302a2c40088de89feb37964f182de33279df818).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RFormula(override val uid: String) extends 
Estimator[RFormulaModel] with RFormulaBase `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124338457
  
  [Test build #38316 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38316/consoleFull)
 for   PR 7574 at commit 
[`c302a2c`](https://github.com/apache/spark/commit/c302a2c40088de89feb37964f182de33279df818).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124769768
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124769744
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124770734
  
  [Test build #38410 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38410/consoleFull)
 for   PR 7574 at commit 
[`0bf3c26`](https://github.com/apache/spark/commit/0bf3c2630d20408234bef9fe6358a4cca9952125).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124776979
  
  [Test build #38410 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38410/console)
 for   PR 7574 at commit 
[`0bf3c26`](https://github.com/apache/spark/commit/0bf3c2630d20408234bef9fe6358a4cca9952125).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RFormula(override val uid: String) extends 
Estimator[RFormulaModel] with RFormulaBase `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124777072
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124338057
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-124338078
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-23 Thread ericl
Github user ericl commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123988222
  
Hmm, I guess that is pretty harmless though. Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-23 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123994687
  
You can construct a `Pipeline` object in `RFormula.fit`, which contains all 
`StringIndexer`, `OneHotEncoder`, etc. Then call `Pipeline.fit` in 
`RFormula.fit` and get the fitted `PipelineModel`. Pass it to `RFormulaModel`. 
`RFormulaModel` becomes a simple wrapper over the fitted pipeline.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-22 Thread ericl
Github user ericl commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123961633
  
@mengxr to clarify, not calling `StringIndexer.fit` in `RFormula.fit` means 
RFormulaModel will have a reference to the original fitted dataset, correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35279252
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -62,19 +77,60 @@ class RFormula(override val uid: String)
   /** @group getParam */
   def getFormula: String = $(formula)
 
-  /** @group getParam */
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+  override def fit(dataset: DataFrame): RFormulaModel = {
+require(parsedFormula.isDefined, Must call setFormula() first.)
+val factorLevels = parsedFormula.get.terms.flatMap { term =
+  dataset.schema(term) match {
+case column if column.dataType == StringType =
+  val idxTerm = term + _idx_ + uid
+  val indexer = new 
StringIndexer().setInputCol(term).setOutputCol(idxTerm)
+  Some(term - indexer.fit(dataset))
+case _ =
+  None
--- End diff --

It might be simpler to construct the entire preprocessing pipeline in 
`fit`, which includes `StringIndexer`s, `OneHotEncoder`, and `VectorAssembler`. 
Then call `fit` on the pipeline and pass the `PipelineModel` to 
`RFormulaModel`. We might add `StringVectorizer` to combine `StringIndexer` and 
`OneHotEncoder` in the future.

I'm a little worried about the generated feature names. But we could 
address this issue separately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35279311
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -130,9 +173,52 @@ class RFormula(override val uid: String)
   Label column already exists and is not of type DoubleType.)
   }
 
-  private def hasLabelCol(schema: StructType): Boolean = {
-schema.map(_.name).contains($(labelCol))
+  private def featureTransformer(schema: StructType): Transformer = {
+// StringType terms and terms representing interactions need to be 
encoded before assembly.
+// TODO(ekl) add support for feature interactions
+var encoderStages = Seq[Transformer]()
+var tempColumns = Seq[String]()
+val encodedTerms = parsedFormula.terms.map { term =
+  schema(term) match {
+case column if column.dataType == StringType =
+  val encodedTerm = term + _onehot_ + uid
+  val indexer = factorLevels(term)
+  val indexCol = indexer.getOrDefault(indexer.outputCol)
+  encoderStages :+= indexer
+  encoderStages :+= new OneHotEncoder()
+.setInputCol(indexCol)
+.setOutputCol(encodedTerm)
+  tempColumns :+= encodedTerm
+  tempColumns :+= indexCol
+  encodedTerm
+case _ =
+  term
+  }
+}
+encoderStages :+= new VectorAssembler(uid)
+  .setInputCols(encodedTerms.toArray)
+  .setOutputCol($(featuresCol))
+encoderStages :+= new ColumnPruner(tempColumns.toSet)
+new PipelineModel(uid, encoderStages.toArray)
+  }
+}
+
+/**
+ * Utility transformer for removing temporary columns from a DataFrame.
+ */
+private class ColumnPruner(columnsToPrune: Set[String]) extends 
Transformer {
--- End diff --

Leave a TODO note to make this a public transformer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-22 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123913341
  
@ericl I think it is simpler to construct a `pipeline` in `RFormula.fit` 
without calling `StringIndexer.fit` explicitly. That leaves space for 
`pipeline.fit` optimization. Then `RFormulaModel` takes the `PipelineModel` 
object directly, which does most of the job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7574#discussion_r35279570
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -130,9 +173,52 @@ class RFormula(override val uid: String)
   Label column already exists and is not of type DoubleType.)
   }
 
-  private def hasLabelCol(schema: StructType): Boolean = {
-schema.map(_.name).contains($(labelCol))
+  private def featureTransformer(schema: StructType): Transformer = {
+// StringType terms and terms representing interactions need to be 
encoded before assembly.
+// TODO(ekl) add support for feature interactions
+var encoderStages = Seq[Transformer]()
+var tempColumns = Seq[String]()
+val encodedTerms = parsedFormula.terms.map { term =
+  schema(term) match {
+case column if column.dataType == StringType =
+  val encodedTerm = term + _onehot_ + uid
+  val indexer = factorLevels(term)
+  val indexCol = indexer.getOrDefault(indexer.outputCol)
+  encoderStages :+= indexer
+  encoderStages :+= new OneHotEncoder()
+.setInputCol(indexCol)
+.setOutputCol(encodedTerm)
+  tempColumns :+= encodedTerm
+  tempColumns :+= indexCol
+  encodedTerm
+case _ =
+  term
+  }
+}
+encoderStages :+= new VectorAssembler(uid)
+  .setInputCols(encodedTerms.toArray)
+  .setOutputCol($(featuresCol))
+encoderStages :+= new ColumnPruner(tempColumns.toSet)
+new PipelineModel(uid, encoderStages.toArray)
+  }
+}
+
+/**
+ * Utility transformer for removing temporary columns from a DataFrame.
+ */
+private class ColumnPruner(columnsToPrune: Set[String]) extends 
Transformer {
+  override val uid = Identifiable.randomUID(columnPruner)
+  override def transform(dataset: DataFrame): DataFrame = {
+var res: DataFrame = dataset
+for (column - columnsToPrune) {
+  res = res.drop(column)
--- End diff --

Calling `drop` one by one might increase the stack size. We can get output 
columns by `dataset.columns.toSet -- columnsToPrune` and then call `select` 
directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123488315
  
  [Test build #37982 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37982/console)
 for   PR 7574 at commit 
[`4d79193`](https://github.com/apache/spark/commit/4d79193d478aeca8fae0f31c15808d6dccb40718).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123488427
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123475216
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123475106
  
  [Test build #37977 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37977/console)
 for   PR 7574 at commit 
[`169a085`](https://github.com/apache/spark/commit/169a0850fc40964194e48c4b317b74226a542cd5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123479213
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123479155
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123467145
  
  [Test build #37977 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37977/consoleFull)
 for   PR 7574 at commit 
[`169a085`](https://github.com/apache/spark/commit/169a0850fc40964194e48c4b317b74226a542cd5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123465703
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123465662
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7574#issuecomment-123480884
  
  [Test build #37982 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37982/consoleFull)
 for   PR 7574 at commit 
[`4d79193`](https://github.com/apache/spark/commit/4d79193d478aeca8fae0f31c15808d6dccb40718).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...

2015-07-21 Thread ericl
GitHub user ericl opened a pull request:

https://github.com/apache/spark/pull/7574

[SPARK-9230] [ML] Support StringType features in RFormula

This adds StringType feature support via OneHotEncoder. As part of this 
task it was necessary to change RFormula to an Estimator, so that factor levels 
could be determined from the training dataset.

Not sure if I am using uids correctly here, would be good to get reviewer 
help on that.
cc @mengxr 

Umbrella design doc: 
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ericl/spark string-features

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7574.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7574


commit a1d03f44f7e226198bde129cc0f40827761bff17
Author: Eric Liang e...@databricks.com
Date:   2015-07-20T22:25:55Z

refactor into estimator

commit 8a637db882175161ef17dce0795cf1576b594f20
Author: Eric Liang e...@databricks.com
Date:   2015-07-20T23:40:20Z

encoder wip

commit b01c7c5c90efac1d3470b2c463fddf91fbf67408
Author: Eric Liang e...@databricks.com
Date:   2015-07-21T00:53:11Z

add test

commit 5b2c4a2d8c29065a232aa207deaa6e869e545131
Author: Eric Liang e...@databricks.com
Date:   2015-07-21T01:45:33Z

Mon Jul 20 18:45:33 PDT 2015

commit d841cec4f42cef5dbda3d43e036964ae63fd71c9
Author: Eric Liang e...@databricks.com
Date:   2015-07-21T17:49:29Z

Merge branch 'master' into string-features

Conflicts:
mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

commit 72bd6f333dd118a900338917213bb8e75144c6e7
Author: Eric Liang e...@databricks.com
Date:   2015-07-21T19:22:57Z

fix merge

commit a230a4790c5163d337781fb9f50cca8a7f83a8b1
Author: Eric Liang e...@databricks.com
Date:   2015-07-21T19:49:03Z

Merge branch 'master' into string-features

commit 169a0850fc40964194e48c4b317b74226a542cd5
Author: Eric Liang e...@databricks.com
Date:   2015-07-21T20:08:48Z

tweak functional test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org