[
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250475#comment-15250475
]
Nick Pentreath edited comment on SPARK-14760 at 4/20/16 6:38 PM:
-
I've noticed that most (in fact pretty much all) transformers and models don't
actually use the output of {{transformSchema}}. That is, they call it
predominantly for parameter and input schema validation, and often do something
like {{SchemaUtils.appendColumn(schema, ...)}} at the end of
{{transformSchema}}. But that returned output schema is never actually used to
generate the output DataFrame, almost invariably it's a bunch of selects and
transforms on the original dataframe. This just happens to work because of the
input schema validation and the operations performed on the input DF.
I ran into this when trying to append a nullable column to a predictor. Putting
something in {{transformSchema}} by itself does nothing unless the result is
actually used (and then using the schema is clunky, you need to convert to
{{RDD[Row]}} and re-create the DF).
Is this just an oversight? [~josephkb]
was (Author: mlnick):
I've noticed that most (in fact pretty much all) transformers and models don't
actually use the output of {{transformSchema}}. That is, they call it
predominantly for parameter and input schema validation, and often do something
like {{SchemaUtils.appendColumn(schema, ...)}} at the end of
{{transformSchema}}. But that returned output schema is never actually used to
generate the output DataFrame, almost invariably it's a bunch of selects and
transforms on the original dataframe. This just happens to work because of the
input schema validation and the operations performed on the input DF.
I ran into this when trying to append a nullable column to a predictor. Putting
something in {{transformSchema}} by itself does nothing unless the result is
actually used (and then using the schema is clunky, you need to convert to
`RDD[Row]` and re-create the DF).
Is this just an oversight? [~josephkb]
> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
> Issue Type: Improvement
> Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter
> validation, transformers should always invoke transformSchema in transform
> and fit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org