[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2017-07-24 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098940#comment-16098940
 ] 

yuhao yang commented on SPARK-14760:


Close it since it's been overlooked for some time. Thanks for the review and 
comments.

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2016-04-21 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251391#comment-15251391
 ] 

Nick Pentreath commented on SPARK-14760:


In general, given the name {{transformSchema}}, one would expect the method to 
actually transform the input schema into the output schema. This is the case, 
but only a few transformers actually seem to use the output schema returned 
from {{transformSchema}}. Hence, the output schema enforced in 
{{transformSchema}} is not actually enforced in {{fit}} or {{transform}}.

So in a Pipeline, you can call {{transformSchema}} for each stage, which 
performs validation upfront, but if the individual transformers don't enforce 
the output schema returned, you can have a situation where the schema 
validation succeeds but a pipeline stage does something different and breaks it.

IMO the approach used by those examples {{HashingTF}}, {{Binarizer}} is correct 
and other transformers should do the same, no?

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2016-04-21 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251372#comment-15251372
 ] 

Nick Pentreath commented on SPARK-14760:


It seems it is there for validation now, but then the name is a bit misleading.

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2016-04-21 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251324#comment-15251324
 ] 

yuhao yang commented on SPARK-14760:


[~holdenkarau] also shared some thoughts in the PR. 

In my opinion, the design needs to cover
1. Pipeline scenario: We can use transformSchema to conduct validation before 
actual transform happens. This is especially helpful and efficient in a 
pipeline.
2. A transformer is used independently: transformSchema mainly provides the 
validation function in this case. And actually, some transformers are using 
transformSchema to get the output schema in transform, such like HashingTF, 
Binarizer, ChiSqSelectorModel.

>From design perspective, transformSchema should cover validation (including 
>friendly error handling) and schema transform, thus that transform/fit can 
>trust the dataset meets certain hypothesis. That's why this jira is created. 


> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2016-04-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250887#comment-15250887
 ] 

Joseph K. Bradley commented on SPARK-14760:
---

transformSchema is there for schema validation  (see [SPARK-14608])  It'd be 
worth discussing whether an individual transformer needs to invoke schema 
validation before fitting or transforming.  I'd say that, in general, it is a 
judgement call depending on whether transformSchema is expensive/cheap, throws 
a better error than fitting without a check, etc.

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2016-04-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250475#comment-15250475
 ] 

Nick Pentreath commented on SPARK-14760:


I've noticed that most (in fact pretty much all) transformers and models don't 
actually use the output of {{transformSchema}}. That is, they call it 
predominantly for parameter and input schema validation, and often do something 
like {{SchemaUtils.appendColumn(schema, ...)}} at the end of 
{{transformSchema}}. But that returned output schema is never actually used to 
generate the output DataFrame, almost invariably it's a bunch of selects and 
transforms on the original dataframe. This just happens to work because of the 
input schema validation and the operations performed on the input DF.

I ran into this when trying to append a nullable column to a predictor. Putting 
something in {{transformSchema}} by itself does nothing unless the result is 
actually used (and then using the schema is clunky, you need to convert to 
`RDD[Row]` and re-create the DF).

Is this just an oversight? [~josephkb]

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2016-04-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250208#comment-15250208
 ] 

Apache Spark commented on SPARK-14760:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/12533

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org