[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-25 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17187:
-
Assignee: Sean Zhong

> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost when converting 
> a domain model to Spark sql supported data type. For example, the current 
> implementation of `TypedAggregateExpression` requires 
> serialization/de-serialization for each call of update or merge.
> *Proposal*
> We propose:
> 1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
> object as aggregation buffer, with requirements like:
>  -  It is flexible enough that the API allows using any java object as 
> aggregation buffer, so that it is easier to integrate with existing Monoid 
> libraries like algebird.
>  -  We don't need to call serialize/deserialize for each call of 
> update/merge. Instead, only a few serialization/deserialization operations 
> are needed. This is to guarantee theperformance.
> 2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
> higher performance.
> 3. Implements Appro-Percentile and other aggregation functions which has a 
> complex aggregation object with this new interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17187:
---
Description: 
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.

  was:
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.
2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.
3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:
1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
a. It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
b. We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee the performance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.


> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy serialization/deserialization cost 

[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object

2016-08-22 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17187:
---
Description: 
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.

2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.

3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.

  was:
*Background*

For aggregation functions like sum and count, Spark-Sql internally use an 
aggregation buffer to store the intermediate aggregation result for all 
aggregation functions. Each aggregation function will occupy a section of 
aggregation buffer.

*Problem*

Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
data types stored in aggregation buffer, which is not very convenient or 
performant, there are several typical cases like:

1. If the aggregation has a complex model CountMinSketch, it is not very easy 
to convert the complex model so that it can be stored with limited Spark-sql 
supported data types.

2. It is hard to reuse aggregation class definition defined in existing 
libraries like algebird.

3. It may introduces heavy serialization/deserialization cost when converting a 
domain model to Spark sql supported data type. For example, the current 
implementation of `TypedAggregateExpression` requires 
serialization/de-serialization for each call of update or merge.

*Proposal*

We propose:

1. Introduces a TypedImperativeAggregate which allows using arbitrary java 
object as aggregation buffer, with requirements like:
 -  It is flexible enough that the API allows using any java object as 
aggregation buffer, so that it is easier to integrate with existing Monoid 
libraries like algebird.
 -  We don't need to call serialize/deserialize for each call of 
update/merge. Instead, only a few serialization/deserialization operations are 
needed. This is to guarantee theperformance.
2.  Refactors `TypedAggregateExpression` to use this new interface, to get 
higher performance.
3. Implements Appro-Percentile and other aggregation functions which has a 
complex aggregation object with this new interface.


> Support using arbitrary Java object as internal aggregation buffer object
> -
>
> Key: SPARK-17187
> URL: https://issues.apache.org/jira/browse/SPARK-17187
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Sean Zhong
>
> *Background*
> For aggregation functions like sum and count, Spark-Sql internally use an 
> aggregation buffer to store the intermediate aggregation result for all 
> aggregation functions. Each aggregation function will occupy a section of 
> aggregation buffer.
> *Problem*
> Currently, Spark-sql only allows a small set of Spark-Sql supported storage 
> data types stored in aggregation buffer, which is not very convenient or 
> performant, there are several typical cases like:
> 1. If the aggregation has a complex model CountMinSketch, it is not very easy 
> to convert the complex model so that it can be stored with limited Spark-sql 
> supported data types.
> 2. It is hard to reuse aggregation class definition defined in existing 
> libraries like algebird.
> 3. It may introduces heavy