[jira] [Comment Edited] (SPARK-19634) Feature parity for descriptive statistics in MLlib

Seth Hendrickson (JIRA) Mon, 27 Mar 2017 15:25:03 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944030#comment-15944030
 ]


Seth Hendrickson edited comment on SPARK-19634 at 3/27/17 10:23 PM:
--------------------------------------------------------------------

I'm coming to this a bit late, but I'm finding things a bit hard to follow. 
Reading the design doc, it seems that the original plan was to implement two 
interfaces - an RDD one that provides the same performance as current 
{{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a 
(more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the 
it was pivoted away from UDAF, but the design doc does not reflect that. Also, 
if there is to be an RDD interface, what is the JIRA for it and what will it 
look like?

Also, there are several concerns raised in the design doc about this Catalyst 
aggregate approach being less efficient, and the consensus seemed to be: 
provide an initial API with a "slow" implementation that will be improved upon 
in the future. Is that correct? I'm not that familiar with the Catalyst 
optimizer, but are we sure there is a good way to implement the tree-reduce 
type aggregation, and if so could we document that? I'd prefer to get the 
details hashed out further rather than rushing to provide an API and initial 
slow implementation, that way we can make sure that we get this correct in the 
long-term. I really appreciate some clarification and my apologies if I have 
missed any of the details/discussion.


was (Author: sethah):
I'm coming to this a bit late, but I'm finding things a bit hard to follow. 
Reading the design doc, it seems that the original plan was to implement two 
interfaces - an RDD one that provides the same performance as current 
{{MultivariateOnlineSummarizer}} and a data frame interface using UDAF. 

from design doc:
"...In the meantime, there will be a (possibly faster) RDD interface and a 
(more flexible) Dataframe interface."

Now, the PR for this uses {{TypedImperativeAggregate}}. I understand that the 
it was pivoted away from UDAF, but the design doc does not reflect that. Also, 
if there is to be an RDD interface, what is the JIRA for it and what will it 
look like?

Also, there are several concerns raised in the design doc about this Catalyst 
aggregate approach being less efficient, and the consensus seemed to be: 
provide an initial API with a "slow" implementation that will be improved upon 
in the future. Is that correct? I'm not that familiar with the Catalyst 
optimizer, but are we sure there is a good way to implement the tree-reduce 
type aggregation, and if so could we document that? If this is still targeted 
at 2.2, why? I'd prefer to get the details hashed out further rather than 
rushing to provide an API and initial slow implementation, that way we can make 
sure that we get this correct in the long-term. I really appreciate some 
clarification and my apologies if I have missed any of the details/discussion.

> Feature parity for descriptive statistics in MLlib
> --------------------------------------------------
>
>                 Key: SPARK-19634
>                 URL: https://issues.apache.org/jira/browse/SPARK-19634
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Timothy Hunter
>            Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19634) Feature parity for descriptive statistics in MLlib

Reply via email to