[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125295030 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125295005 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125294308 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125300346 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35596332 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -114,25 +177,29 @@ class RFormula(override val uid: String) } } - private def transformFeatures: Transformer = { -// TODO(ekl) add support for non-numeric features and feature interactions -new VectorAssembler(uid) - .setInputCols(parsedFormula.get.terms.toArray) - .setOutputCol($(featuresCol)) - } - private def checkCanTransform(schema: StructType) { -require(parsedFormula.isDefined, Must call setFormula() first.) val columnNames = schema.map(_.name) require(!columnNames.contains($(featuresCol)), Features column already exists.) require( !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType, Label column already exists and is not of type DoubleType.) } +} - private def hasLabelCol(schema: StructType): Boolean = { -schema.map(_.name).contains($(labelCol)) +/** + * Utility transformer for removing temporary columns from a DataFrame. + * TODO(ekl) make this a public transformer + */ +private class ColumnPruner(columnsToPrune: Set[String]) extends Transformer { + override val uid = Identifiable.randomUID(columnPruner) + override def transform(dataset: DataFrame): DataFrame = { --- End diff -- insert an empty line between methods definitions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35596324 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -62,19 +77,72 @@ class RFormula(override val uid: String) /** @group getParam */ def getFormula: String = $(formula) - /** @group getParam */ - def setFeaturesCol(value: String): this.type = set(featuresCol, value) + override def fit(dataset: DataFrame): RFormulaModel = { +require(parsedFormula.isDefined, Must call setFormula() first.) +// StringType terms and terms representing interactions need to be encoded before assembly. +// TODO(ekl) add support for feature interactions +var encoderStages = Seq[PipelineStage]() +var tempColumns = Seq[String]() +val encodedTerms = parsedFormula.get.terms.map { term = + dataset.schema(term) match { +case column if column.dataType == StringType = + val indexCol = term + _idx_ + uid + val encodedCol = term + _onehot_ + uid + encoderStages :+= new StringIndexer().setInputCol(term).setOutputCol(indexCol) + encoderStages :+= new OneHotEncoder().setInputCol(indexCol).setOutputCol(encodedCol) + tempColumns :+= indexCol + tempColumns :+= encodedCol + encodedCol +case _ = + term + } +} +encoderStages :+= new VectorAssembler(uid) + .setInputCols(encodedTerms.toArray) + .setOutputCol($(featuresCol)) +encoderStages :+= new ColumnPruner(tempColumns.toSet) +val pipelineModel = new Pipeline(uid).setStages(encoderStages.toArray).fit(dataset) +copyValues(new RFormulaModel(uid, parsedFormula.get, pipelineModel).setParent(this)) + } - /** @group getParam */ - def setLabelCol(value: String): this.type = set(labelCol, value) + // optimistic schema; does not contain any ML attributes + override def transformSchema(schema: StructType): StructType = { +if (hasLabelCol(schema)) { + StructType(schema.fields :+ StructField($(featuresCol), new VectorUDT, true)) +} else { + StructType(schema.fields :+ StructField($(featuresCol), new VectorUDT, true) :+ +StructField($(labelCol), DoubleType, true)) +} + } + + override def copy(extra: ParamMap): RFormula = defaultCopy(extra) + + override def toString: String = sRFormula(${get(formula)}) +} + +/** + * A fitted RFormula. Fitting is required to determine the factor levels of formula terms. + * @param parsedFormula a pre-parsed R formula. + * @param pipelineModel the fitted feature model, including factor to index mappings. + */ +private[feature] class RFormulaModel( --- End diff -- The class should be public because it appears in `RFormula.fit`, which is a public API. The constructor should be package private instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35596320 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -62,19 +77,72 @@ class RFormula(override val uid: String) /** @group getParam */ def getFormula: String = $(formula) - /** @group getParam */ - def setFeaturesCol(value: String): this.type = set(featuresCol, value) + override def fit(dataset: DataFrame): RFormulaModel = { +require(parsedFormula.isDefined, Must call setFormula() first.) +// StringType terms and terms representing interactions need to be encoded before assembly. +// TODO(ekl) add support for feature interactions +var encoderStages = Seq[PipelineStage]() --- End diff -- minor: `Seq` could be replaced by `ArrayBuffer` to avoid creating temp sequences. Then `:+=` below becomes `+=`, slightly simpler to read. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125372454 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125372440 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125376053 [Test build #38602 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38602/consoleFull) for PR 7574 at commit [`f99131a`](https://github.com/apache/spark/commit/f99131ae1fcc5f84035cef20ad5d6231a38712d3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125375942 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125375931 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35598513 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext { val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + formula.fit(original) } intercept[IllegalArgumentException] { - formula.transform(original) + formula.fit(original) } } test(label column already exists) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } test(label column already exists but is not double type) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF(x, y) +val model = formula.fit(original) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + model.transformSchema(original.schema) } intercept[IllegalArgumentException] { - formula.transform(original) + model.transform(original) } } test(allow missing label column for test datasets) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(label) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, _not_y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) assert(!resultSchema.exists(_.name == label)) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } -// TODO(ekl) enable after we implement string label support -// test(transform string label) { -//val formula = new RFormula().setFormula(name ~ id) -//val original = sqlContext.createDataFrame( -// Seq((1, foo), (2, bar), (3, bar))).toDF(id, name) -//val result = formula.transform(original) -//val resultSchema = formula.transformSchema(original.schema) -//val expected = sqlContext.createDataFrame( -// Seq( -//(1, foo, Vectors.dense(Array(1.0)), 1.0), -//(2, bar, Vectors.dense(Array(2.0)), 0.0), -//(3, bar, Vectors.dense(Array(3.0)), 0.0)) -// ).toDF(id, name, features, label) -//assert(result.schema.toString == resultSchema.toString) -//assert(result.collect().toSeq == expected.collect().toSeq) -// } + test(encodes string terms) { +val formula = new RFormula().setFormula(id ~ a + b) +val original = sqlContext.createDataFrame( + Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 5))).toDF(id, a, b) +val model = formula.fit(original) +val result = model.transform(original) +val resultSchema = model.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(1, foo, 4, Vectors.dense(Array(0.0, 1.0, 4.0)), 1.0), +(2, bar, 4, Vectors.dense(Array(1.0, 0.0, 4.0)), 2.0), +(3, bar, 5, Vectors.dense(Array(1.0, 0.0, 5.0)), 3.0), +(4, baz, 5, Vectors.dense(Array(0.0, 0.0, 5.0)), 4.0)) + ).toDF(id, a, b, features, label) +assert(result.schema.toString == resultSchema.toString) +assert(result.collect().toSeq == expected.collect().toSeq) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35598495 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -114,25 +177,29 @@ class RFormula(override val uid: String) } } - private def transformFeatures: Transformer = { -// TODO(ekl) add support for non-numeric features and feature interactions -new VectorAssembler(uid) - .setInputCols(parsedFormula.get.terms.toArray) - .setOutputCol($(featuresCol)) - } - private def checkCanTransform(schema: StructType) { -require(parsedFormula.isDefined, Must call setFormula() first.) val columnNames = schema.map(_.name) require(!columnNames.contains($(featuresCol)), Features column already exists.) require( !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType, Label column already exists and is not of type DoubleType.) } +} - private def hasLabelCol(schema: StructType): Boolean = { -schema.map(_.name).contains($(labelCol)) +/** + * Utility transformer for removing temporary columns from a DataFrame. + * TODO(ekl) make this a public transformer + */ +private class ColumnPruner(columnsToPrune: Set[String]) extends Transformer { + override val uid = Identifiable.randomUID(columnPruner) + override def transform(dataset: DataFrame): DataFrame = { --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35598510 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext { val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + formula.fit(original) } intercept[IllegalArgumentException] { - formula.transform(original) + formula.fit(original) } } test(label column already exists) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } test(label column already exists but is not double type) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF(x, y) +val model = formula.fit(original) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + model.transformSchema(original.schema) } intercept[IllegalArgumentException] { - formula.transform(original) + model.transform(original) } } test(allow missing label column for test datasets) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(label) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, _not_y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) assert(!resultSchema.exists(_.name == label)) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } -// TODO(ekl) enable after we implement string label support -// test(transform string label) { -//val formula = new RFormula().setFormula(name ~ id) -//val original = sqlContext.createDataFrame( -// Seq((1, foo), (2, bar), (3, bar))).toDF(id, name) -//val result = formula.transform(original) -//val resultSchema = formula.transformSchema(original.schema) -//val expected = sqlContext.createDataFrame( -// Seq( -//(1, foo, Vectors.dense(Array(1.0)), 1.0), -//(2, bar, Vectors.dense(Array(2.0)), 0.0), -//(3, bar, Vectors.dense(Array(3.0)), 0.0)) -// ).toDF(id, name, features, label) -//assert(result.schema.toString == resultSchema.toString) -//assert(result.collect().toSeq == expected.collect().toSeq) -// } + test(encodes string terms) { +val formula = new RFormula().setFormula(id ~ a + b) +val original = sqlContext.createDataFrame( + Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 5))).toDF(id, a, b) +val model = formula.fit(original) +val result = model.transform(original) +val resultSchema = model.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(1, foo, 4, Vectors.dense(Array(0.0, 1.0, 4.0)), 1.0), --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35598489 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -62,19 +77,72 @@ class RFormula(override val uid: String) /** @group getParam */ def getFormula: String = $(formula) - /** @group getParam */ - def setFeaturesCol(value: String): this.type = set(featuresCol, value) + override def fit(dataset: DataFrame): RFormulaModel = { +require(parsedFormula.isDefined, Must call setFormula() first.) +// StringType terms and terms representing interactions need to be encoded before assembly. +// TODO(ekl) add support for feature interactions +var encoderStages = Seq[PipelineStage]() +var tempColumns = Seq[String]() +val encodedTerms = parsedFormula.get.terms.map { term = + dataset.schema(term) match { +case column if column.dataType == StringType = + val indexCol = term + _idx_ + uid + val encodedCol = term + _onehot_ + uid + encoderStages :+= new StringIndexer().setInputCol(term).setOutputCol(indexCol) + encoderStages :+= new OneHotEncoder().setInputCol(indexCol).setOutputCol(encodedCol) + tempColumns :+= indexCol + tempColumns :+= encodedCol + encodedCol +case _ = + term + } +} +encoderStages :+= new VectorAssembler(uid) + .setInputCols(encodedTerms.toArray) + .setOutputCol($(featuresCol)) +encoderStages :+= new ColumnPruner(tempColumns.toSet) +val pipelineModel = new Pipeline(uid).setStages(encoderStages.toArray).fit(dataset) +copyValues(new RFormulaModel(uid, parsedFormula.get, pipelineModel).setParent(this)) + } - /** @group getParam */ - def setLabelCol(value: String): this.type = set(labelCol, value) + // optimistic schema; does not contain any ML attributes + override def transformSchema(schema: StructType): StructType = { +if (hasLabelCol(schema)) { + StructType(schema.fields :+ StructField($(featuresCol), new VectorUDT, true)) +} else { + StructType(schema.fields :+ StructField($(featuresCol), new VectorUDT, true) :+ +StructField($(labelCol), DoubleType, true)) +} + } + + override def copy(extra: ParamMap): RFormula = defaultCopy(extra) + + override def toString: String = sRFormula(${get(formula)}) +} + +/** + * A fitted RFormula. Fitting is required to determine the factor levels of formula terms. + * @param parsedFormula a pre-parsed R formula. + * @param pipelineModel the fitted feature model, including factor to index mappings. + */ +private[feature] class RFormulaModel( --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125371041 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35596483 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext { val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + formula.fit(original) } intercept[IllegalArgumentException] { - formula.transform(original) + formula.fit(original) } } test(label column already exists) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } test(label column already exists but is not double type) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF(x, y) +val model = formula.fit(original) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + model.transformSchema(original.schema) } intercept[IllegalArgumentException] { - formula.transform(original) + model.transform(original) } } test(allow missing label column for test datasets) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(label) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, _not_y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) assert(!resultSchema.exists(_.name == label)) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } -// TODO(ekl) enable after we implement string label support -// test(transform string label) { -//val formula = new RFormula().setFormula(name ~ id) -//val original = sqlContext.createDataFrame( -// Seq((1, foo), (2, bar), (3, bar))).toDF(id, name) -//val result = formula.transform(original) -//val resultSchema = formula.transformSchema(original.schema) -//val expected = sqlContext.createDataFrame( -// Seq( -//(1, foo, Vectors.dense(Array(1.0)), 1.0), -//(2, bar, Vectors.dense(Array(2.0)), 0.0), -//(3, bar, Vectors.dense(Array(3.0)), 0.0)) -// ).toDF(id, name, features, label) -//assert(result.schema.toString == resultSchema.toString) -//assert(result.collect().toSeq == expected.collect().toSeq) -// } + test(encodes string terms) { +val formula = new RFormula().setFormula(id ~ a + b) +val original = sqlContext.createDataFrame( + Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 5))).toDF(id, a, b) --- End diff -- minor: move `).toDF(...)` to next line for readability --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35596486 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext { val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + formula.fit(original) } intercept[IllegalArgumentException] { - formula.transform(original) + formula.fit(original) } } test(label column already exists) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } test(label column already exists but is not double type) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF(x, y) +val model = formula.fit(original) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + model.transformSchema(original.schema) } intercept[IllegalArgumentException] { - formula.transform(original) + model.transform(original) } } test(allow missing label column for test datasets) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(label) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, _not_y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) assert(!resultSchema.exists(_.name == label)) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } -// TODO(ekl) enable after we implement string label support -// test(transform string label) { -//val formula = new RFormula().setFormula(name ~ id) -//val original = sqlContext.createDataFrame( -// Seq((1, foo), (2, bar), (3, bar))).toDF(id, name) -//val result = formula.transform(original) -//val resultSchema = formula.transformSchema(original.schema) -//val expected = sqlContext.createDataFrame( -// Seq( -//(1, foo, Vectors.dense(Array(1.0)), 1.0), -//(2, bar, Vectors.dense(Array(2.0)), 0.0), -//(3, bar, Vectors.dense(Array(3.0)), 0.0)) -// ).toDF(id, name, features, label) -//assert(result.schema.toString == resultSchema.toString) -//assert(result.collect().toSeq == expected.collect().toSeq) -// } + test(encodes string terms) { +val formula = new RFormula().setFormula(id ~ a + b) +val original = sqlContext.createDataFrame( + Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 5))).toDF(id, a, b) +val model = formula.fit(original) +val result = model.transform(original) +val resultSchema = model.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(1, foo, 4, Vectors.dense(Array(0.0, 1.0, 4.0)), 1.0), --- End diff -- `Array(...)` is not necessary. `Vectors.dense` takes varargs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35596488 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext { val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + formula.fit(original) } intercept[IllegalArgumentException] { - formula.transform(original) + formula.fit(original) } } test(label column already exists) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } test(label column already exists but is not double type) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF(x, y) +val model = formula.fit(original) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + model.transformSchema(original.schema) } intercept[IllegalArgumentException] { - formula.transform(original) + model.transform(original) } } test(allow missing label column for test datasets) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(label) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, _not_y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) assert(!resultSchema.exists(_.name == label)) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } -// TODO(ekl) enable after we implement string label support -// test(transform string label) { -//val formula = new RFormula().setFormula(name ~ id) -//val original = sqlContext.createDataFrame( -// Seq((1, foo), (2, bar), (3, bar))).toDF(id, name) -//val result = formula.transform(original) -//val resultSchema = formula.transformSchema(original.schema) -//val expected = sqlContext.createDataFrame( -// Seq( -//(1, foo, Vectors.dense(Array(1.0)), 1.0), -//(2, bar, Vectors.dense(Array(2.0)), 0.0), -//(3, bar, Vectors.dense(Array(3.0)), 0.0)) -// ).toDF(id, name, features, label) -//assert(result.schema.toString == resultSchema.toString) -//assert(result.collect().toSeq == expected.collect().toSeq) -// } + test(encodes string terms) { +val formula = new RFormula().setFormula(id ~ a + b) +val original = sqlContext.createDataFrame( + Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 5))).toDF(id, a, b) +val model = formula.fit(original) +val result = model.transform(original) +val resultSchema = model.transformSchema(original.schema) +val expected = sqlContext.createDataFrame( + Seq( +(1, foo, 4, Vectors.dense(Array(0.0, 1.0, 4.0)), 1.0), +(2, bar, 4, Vectors.dense(Array(1.0, 0.0, 4.0)), 2.0), +(3, bar, 5, Vectors.dense(Array(1.0, 0.0, 5.0)), 3.0), +(4, baz, 5, Vectors.dense(Array(0.0, 0.0, 5.0)), 4.0)) + ).toDF(id, a, b, features, label) +assert(result.schema.toString == resultSchema.toString) +assert(result.collect().toSeq == expected.collect().toSeq) --- End diff -- minor: Again, if you use `===` instead of `==`, we can remove `toSeq`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35598503 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -48,55 +49,59 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext { val formula = new RFormula().setFormula(y ~ x).setFeaturesCol(x) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + formula.fit(original) } intercept[IllegalArgumentException] { - formula.transform(original) + formula.fit(original) } } test(label column already exists) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } test(label column already exists but is not double type) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(y) val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF(x, y) +val model = formula.fit(original) intercept[IllegalArgumentException] { - formula.transformSchema(original.schema) + model.transformSchema(original.schema) } intercept[IllegalArgumentException] { - formula.transform(original) + model.transform(original) } } test(allow missing label column for test datasets) { val formula = new RFormula().setFormula(y ~ x).setLabelCol(label) val original = sqlContext.createDataFrame(Seq((0, 1.0), (2, 2.0))).toDF(x, _not_y) -val resultSchema = formula.transformSchema(original.schema) +val model = formula.fit(original) +val resultSchema = model.transformSchema(original.schema) assert(resultSchema.length == 3) assert(!resultSchema.exists(_.name == label)) -assert(resultSchema.toString == formula.transform(original).schema.toString) +assert(resultSchema.toString == model.transform(original).schema.toString) } -// TODO(ekl) enable after we implement string label support -// test(transform string label) { -//val formula = new RFormula().setFormula(name ~ id) -//val original = sqlContext.createDataFrame( -// Seq((1, foo), (2, bar), (3, bar))).toDF(id, name) -//val result = formula.transform(original) -//val resultSchema = formula.transformSchema(original.schema) -//val expected = sqlContext.createDataFrame( -// Seq( -//(1, foo, Vectors.dense(Array(1.0)), 1.0), -//(2, bar, Vectors.dense(Array(2.0)), 0.0), -//(3, bar, Vectors.dense(Array(3.0)), 0.0)) -// ).toDF(id, name, features, label) -//assert(result.schema.toString == resultSchema.toString) -//assert(result.collect().toSeq == expected.collect().toSeq) -// } + test(encodes string terms) { +val formula = new RFormula().setFormula(id ~ a + b) +val original = sqlContext.createDataFrame( + Seq((1, foo, 4), (2, bar, 4), (3, bar, 5), (4, baz, 5))).toDF(id, a, b) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35598479 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -62,19 +77,72 @@ class RFormula(override val uid: String) /** @group getParam */ def getFormula: String = $(formula) - /** @group getParam */ - def setFeaturesCol(value: String): this.type = set(featuresCol, value) + override def fit(dataset: DataFrame): RFormulaModel = { +require(parsedFormula.isDefined, Must call setFormula() first.) +// StringType terms and terms representing interactions need to be encoded before assembly. +// TODO(ekl) add support for feature interactions +var encoderStages = Seq[PipelineStage]() --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125373141 [Test build #38597 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38597/consoleFull) for PR 7574 at commit [`0bf3c26`](https://github.com/apache/spark/commit/0bf3c2630d20408234bef9fe6358a4cca9952125). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125376595 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125396509 [Test build #38597 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38597/console) for PR 7574 at commit [`0bf3c26`](https://github.com/apache/spark/commit/0bf3c2630d20408234bef9fe6358a4cca9952125). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125396596 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125380508 [Test build #38602 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38602/console) for PR 7574 at commit [`f99131a`](https://github.com/apache/spark/commit/f99131ae1fcc5f84035cef20ad5d6231a38712d3). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125384454 Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7574 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-125380557 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124916428 ptal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35397462 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -130,9 +173,52 @@ class RFormula(override val uid: String) Label column already exists and is not of type DoubleType.) } - private def hasLabelCol(schema: StructType): Boolean = { -schema.map(_.name).contains($(labelCol)) + private def featureTransformer(schema: StructType): Transformer = { +// StringType terms and terms representing interactions need to be encoded before assembly. +// TODO(ekl) add support for feature interactions +var encoderStages = Seq[Transformer]() +var tempColumns = Seq[String]() +val encodedTerms = parsedFormula.terms.map { term = + schema(term) match { +case column if column.dataType == StringType = + val encodedTerm = term + _onehot_ + uid + val indexer = factorLevels(term) + val indexCol = indexer.getOrDefault(indexer.outputCol) + encoderStages :+= indexer + encoderStages :+= new OneHotEncoder() +.setInputCol(indexCol) +.setOutputCol(encodedTerm) + tempColumns :+= encodedTerm + tempColumns :+= indexCol + encodedTerm +case _ = + term + } +} +encoderStages :+= new VectorAssembler(uid) + .setInputCols(encodedTerms.toArray) + .setOutputCol($(featuresCol)) +encoderStages :+= new ColumnPruner(tempColumns.toSet) +new PipelineModel(uid, encoderStages.toArray) + } +} + +/** + * Utility transformer for removing temporary columns from a DataFrame. + */ +private class ColumnPruner(columnsToPrune: Set[String]) extends Transformer { --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35397464 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -130,9 +173,52 @@ class RFormula(override val uid: String) Label column already exists and is not of type DoubleType.) } - private def hasLabelCol(schema: StructType): Boolean = { -schema.map(_.name).contains($(labelCol)) + private def featureTransformer(schema: StructType): Transformer = { +// StringType terms and terms representing interactions need to be encoded before assembly. +// TODO(ekl) add support for feature interactions +var encoderStages = Seq[Transformer]() +var tempColumns = Seq[String]() +val encodedTerms = parsedFormula.terms.map { term = + schema(term) match { +case column if column.dataType == StringType = + val encodedTerm = term + _onehot_ + uid + val indexer = factorLevels(term) + val indexCol = indexer.getOrDefault(indexer.outputCol) + encoderStages :+= indexer + encoderStages :+= new OneHotEncoder() +.setInputCol(indexCol) +.setOutputCol(encodedTerm) + tempColumns :+= encodedTerm + tempColumns :+= indexCol + encodedTerm +case _ = + term + } +} +encoderStages :+= new VectorAssembler(uid) + .setInputCols(encodedTerms.toArray) + .setOutputCol($(featuresCol)) +encoderStages :+= new ColumnPruner(tempColumns.toSet) +new PipelineModel(uid, encoderStages.toArray) + } +} + +/** + * Utility transformer for removing temporary columns from a DataFrame. + */ +private class ColumnPruner(columnsToPrune: Set[String]) extends Transformer { + override val uid = Identifiable.randomUID(columnPruner) + override def transform(dataset: DataFrame): DataFrame = { +var res: DataFrame = dataset +for (column - columnsToPrune) { + res = res.drop(column) --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35397461 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -62,19 +77,60 @@ class RFormula(override val uid: String) /** @group getParam */ def getFormula: String = $(formula) - /** @group getParam */ - def setFeaturesCol(value: String): this.type = set(featuresCol, value) + override def fit(dataset: DataFrame): RFormulaModel = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val factorLevels = parsedFormula.get.terms.flatMap { term = + dataset.schema(term) match { +case column if column.dataType == StringType = + val idxTerm = term + _idx_ + uid + val indexer = new StringIndexer().setInputCol(term).setOutputCol(idxTerm) + Some(term - indexer.fit(dataset)) +case _ = + None --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124346736 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124346148 [Test build #38316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38316/console) for PR 7574 at commit [`c302a2c`](https://github.com/apache/spark/commit/c302a2c40088de89feb37964f182de33279df818). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124338457 [Test build #38316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38316/consoleFull) for PR 7574 at commit [`c302a2c`](https://github.com/apache/spark/commit/c302a2c40088de89feb37964f182de33279df818). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124769768 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124769744 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124770734 [Test build #38410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38410/consoleFull) for PR 7574 at commit [`0bf3c26`](https://github.com/apache/spark/commit/0bf3c2630d20408234bef9fe6358a4cca9952125). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124776979 [Test build #38410 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38410/console) for PR 7574 at commit [`0bf3c26`](https://github.com/apache/spark/commit/0bf3c2630d20408234bef9fe6358a4cca9952125). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124777072 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124338057 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-124338078 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123988222 Hmm, I guess that is pretty harmless though. Will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123994687 You can construct a `Pipeline` object in `RFormula.fit`, which contains all `StringIndexer`, `OneHotEncoder`, etc. Then call `Pipeline.fit` in `RFormula.fit` and get the fitted `PipelineModel`. Pass it to `RFormulaModel`. `RFormulaModel` becomes a simple wrapper over the fitted pipeline. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user ericl commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123961633 @mengxr to clarify, not calling `StringIndexer.fit` in `RFormula.fit` means RFormulaModel will have a reference to the original fitted dataset, correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35279252 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -62,19 +77,60 @@ class RFormula(override val uid: String) /** @group getParam */ def getFormula: String = $(formula) - /** @group getParam */ - def setFeaturesCol(value: String): this.type = set(featuresCol, value) + override def fit(dataset: DataFrame): RFormulaModel = { +require(parsedFormula.isDefined, Must call setFormula() first.) +val factorLevels = parsedFormula.get.terms.flatMap { term = + dataset.schema(term) match { +case column if column.dataType == StringType = + val idxTerm = term + _idx_ + uid + val indexer = new StringIndexer().setInputCol(term).setOutputCol(idxTerm) + Some(term - indexer.fit(dataset)) +case _ = + None --- End diff -- It might be simpler to construct the entire preprocessing pipeline in `fit`, which includes `StringIndexer`s, `OneHotEncoder`, and `VectorAssembler`. Then call `fit` on the pipeline and pass the `PipelineModel` to `RFormulaModel`. We might add `StringVectorizer` to combine `StringIndexer` and `OneHotEncoder` in the future. I'm a little worried about the generated feature names. But we could address this issue separately. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35279311 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -130,9 +173,52 @@ class RFormula(override val uid: String) Label column already exists and is not of type DoubleType.) } - private def hasLabelCol(schema: StructType): Boolean = { -schema.map(_.name).contains($(labelCol)) + private def featureTransformer(schema: StructType): Transformer = { +// StringType terms and terms representing interactions need to be encoded before assembly. +// TODO(ekl) add support for feature interactions +var encoderStages = Seq[Transformer]() +var tempColumns = Seq[String]() +val encodedTerms = parsedFormula.terms.map { term = + schema(term) match { +case column if column.dataType == StringType = + val encodedTerm = term + _onehot_ + uid + val indexer = factorLevels(term) + val indexCol = indexer.getOrDefault(indexer.outputCol) + encoderStages :+= indexer + encoderStages :+= new OneHotEncoder() +.setInputCol(indexCol) +.setOutputCol(encodedTerm) + tempColumns :+= encodedTerm + tempColumns :+= indexCol + encodedTerm +case _ = + term + } +} +encoderStages :+= new VectorAssembler(uid) + .setInputCols(encodedTerms.toArray) + .setOutputCol($(featuresCol)) +encoderStages :+= new ColumnPruner(tempColumns.toSet) +new PipelineModel(uid, encoderStages.toArray) + } +} + +/** + * Utility transformer for removing temporary columns from a DataFrame. + */ +private class ColumnPruner(columnsToPrune: Set[String]) extends Transformer { --- End diff -- Leave a TODO note to make this a public transformer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123913341 @ericl I think it is simpler to construct a `pipeline` in `RFormula.fit` without calling `StringIndexer.fit` explicitly. That leaves space for `pipeline.fit` optimization. Then `RFormulaModel` takes the `PipelineModel` object directly, which does most of the job. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7574#discussion_r35279570 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -130,9 +173,52 @@ class RFormula(override val uid: String) Label column already exists and is not of type DoubleType.) } - private def hasLabelCol(schema: StructType): Boolean = { -schema.map(_.name).contains($(labelCol)) + private def featureTransformer(schema: StructType): Transformer = { +// StringType terms and terms representing interactions need to be encoded before assembly. +// TODO(ekl) add support for feature interactions +var encoderStages = Seq[Transformer]() +var tempColumns = Seq[String]() +val encodedTerms = parsedFormula.terms.map { term = + schema(term) match { +case column if column.dataType == StringType = + val encodedTerm = term + _onehot_ + uid + val indexer = factorLevels(term) + val indexCol = indexer.getOrDefault(indexer.outputCol) + encoderStages :+= indexer + encoderStages :+= new OneHotEncoder() +.setInputCol(indexCol) +.setOutputCol(encodedTerm) + tempColumns :+= encodedTerm + tempColumns :+= indexCol + encodedTerm +case _ = + term + } +} +encoderStages :+= new VectorAssembler(uid) + .setInputCols(encodedTerms.toArray) + .setOutputCol($(featuresCol)) +encoderStages :+= new ColumnPruner(tempColumns.toSet) +new PipelineModel(uid, encoderStages.toArray) + } +} + +/** + * Utility transformer for removing temporary columns from a DataFrame. + */ +private class ColumnPruner(columnsToPrune: Set[String]) extends Transformer { + override val uid = Identifiable.randomUID(columnPruner) + override def transform(dataset: DataFrame): DataFrame = { +var res: DataFrame = dataset +for (column - columnsToPrune) { + res = res.drop(column) --- End diff -- Calling `drop` one by one might increase the stack size. We can get output columns by `dataset.columns.toSet -- columnsToPrune` and then call `select` directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123488315 [Test build #37982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37982/console) for PR 7574 at commit [`4d79193`](https://github.com/apache/spark/commit/4d79193d478aeca8fae0f31c15808d6dccb40718). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123488427 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123475216 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123475106 [Test build #37977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37977/console) for PR 7574 at commit [`169a085`](https://github.com/apache/spark/commit/169a0850fc40964194e48c4b317b74226a542cd5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123479213 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123479155 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123467145 [Test build #37977 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37977/consoleFull) for PR 7574 at commit [`169a085`](https://github.com/apache/spark/commit/169a0850fc40964194e48c4b317b74226a542cd5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123465703 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123465662 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7574#issuecomment-123480884 [Test build #37982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37982/consoleFull) for PR 7574 at commit [`4d79193`](https://github.com/apache/spark/commit/4d79193d478aeca8fae0f31c15808d6dccb40718). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9230] [ML] Support StringType features ...
GitHub user ericl opened a pull request: https://github.com/apache/spark/pull/7574 [SPARK-9230] [ML] Support StringType features in RFormula This adds StringType feature support via OneHotEncoder. As part of this task it was necessary to change RFormula to an Estimator, so that factor levels could be determined from the training dataset. Not sure if I am using uids correctly here, would be good to get reviewer help on that. cc @mengxr Umbrella design doc: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit# You can merge this pull request into a Git repository by running: $ git pull https://github.com/ericl/spark string-features Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7574.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7574 commit a1d03f44f7e226198bde129cc0f40827761bff17 Author: Eric Liang e...@databricks.com Date: 2015-07-20T22:25:55Z refactor into estimator commit 8a637db882175161ef17dce0795cf1576b594f20 Author: Eric Liang e...@databricks.com Date: 2015-07-20T23:40:20Z encoder wip commit b01c7c5c90efac1d3470b2c463fddf91fbf67408 Author: Eric Liang e...@databricks.com Date: 2015-07-21T00:53:11Z add test commit 5b2c4a2d8c29065a232aa207deaa6e869e545131 Author: Eric Liang e...@databricks.com Date: 2015-07-21T01:45:33Z Mon Jul 20 18:45:33 PDT 2015 commit d841cec4f42cef5dbda3d43e036964ae63fd71c9 Author: Eric Liang e...@databricks.com Date: 2015-07-21T17:49:29Z Merge branch 'master' into string-features Conflicts: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala commit 72bd6f333dd118a900338917213bb8e75144c6e7 Author: Eric Liang e...@databricks.com Date: 2015-07-21T19:22:57Z fix merge commit a230a4790c5163d337781fb9f50cca8a7f83a8b1 Author: Eric Liang e...@databricks.com Date: 2015-07-21T19:49:03Z Merge branch 'master' into string-features commit 169a0850fc40964194e48c4b317b74226a542cd5 Author: Eric Liang e...@databricks.com Date: 2015-07-21T20:08:48Z tweak functional test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org