[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-06 Thread Alex Levenson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726986#comment-15726986
 ] 

Alex Levenson commented on SPARK-18728:
---

I think my comment above lists some concrete benefits. Algebird is a very light 
dependency, and if you see anything wrong with it's (small) set of transitive 
dependencies I think we'd be open to figuring out how to fix those sorts of 
issues.

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Improvement
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird";
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726647#comment-15726647
 ] 

Sean Owen commented on SPARK-18728:
---

It's pretty much what I mentioned there: another dependency in the tree. That's 
not a deal-breaker, it's just the question. We wouldn't be able to include 
third-party types in a public API, note.

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Improvement
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird";
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-06 Thread Mansur Ashraf (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726142#comment-15726142
 ] 

Mansur Ashraf commented on SPARK-18728:
---

Hi Sean,

Dataset API has removed 'aggregateByKey` and variants and is only allowing 
passing up to 4 aggregators by doing ds.select(...) which is a downgrade in 
user experience from RDD. What we gain by making this change is that there is 
no arbitrary limit on number of custom aggregators user can pass. We are 
already using CountMinSketch, QTrees, TopK, HLL++ from algebird with Spark in 
addition to other aggregators that are built in house.

Since Spark aggregators are inspired by Algebird aggregators based on the 
comment in the code, why not just use Algebird aggregators instead of copying 
the trait? Dataset API in its current form is not usable for us at Apple Inc. 
due to the limitation I listed above.

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Improvement
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird";
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-05 Thread Alex Levenson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723866#comment-15723866
 ] 

Alex Levenson commented on SPARK-18728:
---

I think the main selling point of Algebird aggregators are:

1) They are composable (you can take a Min aggregator and combine it with a Max 
aggregator to get an aggregator that gets both the Min + Max in 1 pass) -- as 
[~mashraf] points out, you can compose many times to get lots of aggregations 
in 1 pass

2) They have the option for efficient addition methods -- they use algebird's 
Semigroup, which has both plus(a,b) for adding 2 items, and sumOption(iter: 
TraversableOnce[T]) for adding N items. This allows for opting in to efficient 
additions without having a mutable API (sumOption can be mutable internally, 
but it has to be referentially transparent)

3) There are many already built implementations of Aggregator for both common 
types as well as probabilistic data structures available in algebird.

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird";
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-05 Thread Mansur Ashraf (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723786#comment-15723786
 ] 

Mansur Ashraf commented on SPARK-18728:
---

Alex,

Thanks for opening the issue. Let me add some more detail to it. 

We have tons of job on Spark 1.6 that are using Algebird Aggregators through 
`aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators 
are composable (meaning you can combine X number of aggregators to get 1 
combined aggregators), in our jobs we are combining 10+ number of aggregators 
and doing single pass aggregations using aggregateByKey/combineByKey. As we 
upgrade to Spark 2.0.0 and new Dataset 
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset),
 we find out that aggregateByKey/combineByKey are all gone so we cant pass 
algebird aggregators directly, instead there is a new aggregator API based on 
algebird except (as far as I can tell) does not allow joining multiple 
aggregators and limiting number of aggregators to 4.  

It would be really nice if Spark use Algebird aggregators instead of creating 
its own or allow users to pass algebird aggregators in Dataset API in addition 
to Spark aggregators

Thanks

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird";
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org