[GitHub] spark pull request #20229: [SPARK-23037][ML] Update RFormula to use VectorSi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20229#discussion_r160878813 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -199,6 +199,7 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) val parsedFormula = RFormulaParser.parse($(formula)) val resolvedFormula = parsedFormula.resolve(dataset.schema) val encoderStages = ArrayBuffer[PipelineStage]() +val oneHotEncodeStages = ArrayBuffer[(String, String)]() --- End diff -- `oneHotEncodeColumns` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20229: [SPARK-23037][ML] Update RFormula to use VectorSi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20229#discussion_r160878716 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -228,22 +229,33 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) // Then we handle one-hot encoding and interactions between terms. var keepReferenceCategory = false val encodedTerms = resolvedFormula.terms.map { - case Seq(term) if dataset.schema(term).dataType == StringType => -val encodedCol = tmpColumn("onehot") -var encoder = new OneHotEncoder() - .setInputCol(indexed(term)) - .setOutputCol(encodedCol) -// Formula w/o intercept, one of the categories in the first category feature is -// being used as reference category, we will not drop any category for that feature. -if (!hasIntercept && !keepReferenceCategory) { - encoder = encoder.setDropLast(false) - keepReferenceCategory = true -} -encoderStages += encoder -prefixesToRewrite(encodedCol + "_") = term + "_" -encodedCol case Seq(term) => -term +dataset.schema(term).dataType match { + case _: StringType => +val encodedCol = tmpColumn("onehot") +// Formula w/o intercept, one of the categories in the first category feature is +// being used as reference category, we will not drop any category for that feature. +if (!hasIntercept && !keepReferenceCategory) { + keepReferenceCategory = true + encoderStages += new OneHotEncoderEstimator(uid) --- End diff -- Oh, I see. There is one `OneHotEncoderEstimator` for `dropLast=false`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20229: [SPARK-23037][ML] Update RFormula to use VectorSi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20229#discussion_r160877856 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -228,22 +229,33 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) // Then we handle one-hot encoding and interactions between terms. var keepReferenceCategory = false val encodedTerms = resolvedFormula.terms.map { - case Seq(term) if dataset.schema(term).dataType == StringType => -val encodedCol = tmpColumn("onehot") -var encoder = new OneHotEncoder() - .setInputCol(indexed(term)) - .setOutputCol(encodedCol) -// Formula w/o intercept, one of the categories in the first category feature is -// being used as reference category, we will not drop any category for that feature. -if (!hasIntercept && !keepReferenceCategory) { - encoder = encoder.setDropLast(false) - keepReferenceCategory = true -} -encoderStages += encoder -prefixesToRewrite(encodedCol + "_") = term + "_" -encodedCol case Seq(term) => -term +dataset.schema(term).dataType match { + case _: StringType => +val encodedCol = tmpColumn("onehot") +// Formula w/o intercept, one of the categories in the first category feature is +// being used as reference category, we will not drop any category for that feature. +if (!hasIntercept && !keepReferenceCategory) { + keepReferenceCategory = true + encoderStages += new OneHotEncoderEstimator(uid) --- End diff -- As `OneHotEncoderEstimator` can handle multi-column now, can't we just have one `OneHotEncoderEstimator` to fit for all terms at once? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org