[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098940#comment-16098940 ] yuhao yang commented on SPARK-14760: Close it since it's been overlooked for some time. Thanks for the review and comments. > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251391#comment-15251391 ] Nick Pentreath commented on SPARK-14760: In general, given the name {{transformSchema}}, one would expect the method to actually transform the input schema into the output schema. This is the case, but only a few transformers actually seem to use the output schema returned from {{transformSchema}}. Hence, the output schema enforced in {{transformSchema}} is not actually enforced in {{fit}} or {{transform}}. So in a Pipeline, you can call {{transformSchema}} for each stage, which performs validation upfront, but if the individual transformers don't enforce the output schema returned, you can have a situation where the schema validation succeeds but a pipeline stage does something different and breaks it. IMO the approach used by those examples {{HashingTF}}, {{Binarizer}} is correct and other transformers should do the same, no? > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251372#comment-15251372 ] Nick Pentreath commented on SPARK-14760: It seems it is there for validation now, but then the name is a bit misleading. > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251324#comment-15251324 ] yuhao yang commented on SPARK-14760: [~holdenkarau] also shared some thoughts in the PR. In my opinion, the design needs to cover 1. Pipeline scenario: We can use transformSchema to conduct validation before actual transform happens. This is especially helpful and efficient in a pipeline. 2. A transformer is used independently: transformSchema mainly provides the validation function in this case. And actually, some transformers are using transformSchema to get the output schema in transform, such like HashingTF, Binarizer, ChiSqSelectorModel. >From design perspective, transformSchema should cover validation (including >friendly error handling) and schema transform, thus that transform/fit can >trust the dataset meets certain hypothesis. That's why this jira is created. > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250887#comment-15250887 ] Joseph K. Bradley commented on SPARK-14760: --- transformSchema is there for schema validation (see [SPARK-14608]) It'd be worth discussing whether an individual transformer needs to invoke schema validation before fitting or transforming. I'd say that, in general, it is a judgement call depending on whether transformSchema is expensive/cheap, throws a better error than fitting without a check, etc. > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250475#comment-15250475 ] Nick Pentreath commented on SPARK-14760: I've noticed that most (in fact pretty much all) transformers and models don't actually use the output of {{transformSchema}}. That is, they call it predominantly for parameter and input schema validation, and often do something like {{SchemaUtils.appendColumn(schema, ...)}} at the end of {{transformSchema}}. But that returned output schema is never actually used to generate the output DataFrame, almost invariably it's a bunch of selects and transforms on the original dataframe. This just happens to work because of the input schema validation and the operations performed on the input DF. I ran into this when trying to append a nullable column to a predictor. Putting something in {{transformSchema}} by itself does nothing unless the result is actually used (and then using the schema is clunky, you need to convert to `RDD[Row]` and re-create the DF). Is this just an oversight? [~josephkb] > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit
[ https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250208#comment-15250208 ] Apache Spark commented on SPARK-14760: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/12533 > Feature transformers should always invoke transformSchema in transform or fit > - > > Key: SPARK-14760 > URL: https://issues.apache.org/jira/browse/SPARK-14760 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Since one of the primary function for transformSchema is to conduct parameter > validation, transformers should always invoke transformSchema in transform > and fit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org