[jira] [Commented] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438525#comment-15438525
 ] 

Apache Spark commented on SPARK-17187:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14822

> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>  -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>  -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee theperformance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431034#comment-15431034
 ] 

Apache Spark commented on SPARK-17187:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14753

> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>  -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>  -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee theperformance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org