[ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892189#comment-15892189
 ] 

Zhe Sun edited comment on SPARK-19797 at 3/2/17 12:52 PM:
----------------------------------------------------------

Hi Sean, thanks for your quick reply. 

bq. If the Pipeline had more stages, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.

Let's use IDF as an example. If the pipeline is like:
bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression
When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ 
and pass the idf result to LogisticRegression. Because LogisticRegression is an 
Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of 
*IDF*.

However, if the last stage of pipeline is Normalizer 
(https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
bq. Tokenizer -> HashingTF -> IDF -> Normalizer 
When fitting this pipeline, *IDF* will only call _fit_, and do not need to call 
_transform_

That's why I think it is better to modify the description as below to make it 
accurate.
bq. If the Pipeline had more Estimators, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.



was (Author: ymwdalex):
Hi Sean, thanks for your quick reply. 

bq. If the Pipeline had more stages, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.

Let's use IDF as an example. If the pipeline is like:
bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression
When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ 
and pass the idf result to LogisticRegression. Because LogisticRegression is an 
Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of 
*IDF*.

However, if the last stage of pipeline is Normalizer 
(https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
bq. Tokenizer -> HashingTF -> IDF -> Normalizer 
When fitting this pipeline, *IDF* will only call _fit_, and do not need to call 
_transform_

That's why I think it is better to correct the description as 
bq. If the Pipeline had more Estimators, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.


> ML pipelines document error
> ---------------------------
>
>                 Key: SPARK-19797
>                 URL: https://issues.apache.org/jira/browse/SPARK-19797
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Zhe Sun
>            Priority: Trivial
>              Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to