[jira] [Closed] (FLINK-12744) ML common parameters

2019-07-05 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-12744.
-
   Resolution: Fixed
Fix Version/s: 1.9.0

> ML common parameters
> 
>
> Key: FLINK-12744
> URL: https://issues.apache.org/jira/browse/FLINK-12744
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Xu Yang
>Assignee: Xu Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We defined some common-used parameters for machine-learning algorithms.
>  - *add ML common parameters*
>  - *change behavior when use default constructor of param factory*
>  - *add shared params in ml package*
>  - *add flink-ml module*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12744) ML common parameters

2019-07-05 Thread Shaoxuan Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879343#comment-16879343
 ] 

Shaoxuan Wang commented on FLINK-12744:
---

Fixed in master: f18481ee54b61a737ae1426ff4dcaf7006e0edbd

> ML common parameters
> 
>
> Key: FLINK-12744
> URL: https://issues.apache.org/jira/browse/FLINK-12744
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Xu Yang
>Assignee: Xu Yang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We defined some common-used parameters for machine-learning algorithms.
>  - *add ML common parameters*
>  - *change behavior when use default constructor of param factory*
>  - *add shared params in ml package*
>  - *add flink-ml module*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12758) Add flink-ml-lib module

2019-07-05 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-12758.
-
Resolution: Fixed

> Add flink-ml-lib module
> ---
>
> Key: FLINK-12758
> URL: https://issues.apache.org/jira/browse/FLINK-12758
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Luo Gen
>Assignee: Luo Gen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The Jira introduces a new module "flink-ml-lib" under flink-ml-parent.
> The flink-ml-lib is planned in the roadmap in FLIP-39, as the code base of 
> library implementations of FlinkML. This Jira only aims to create the module, 
> and algorithms will be added in separate Jira in the future.
>  For more details, please refer to [FLIP39 design 
> doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12758) Add flink-ml-lib module

2019-07-05 Thread Shaoxuan Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879055#comment-16879055
 ] 

Shaoxuan Wang commented on FLINK-12758:
---

Fixed in master: 726d9e49905e893659a1f7b0ba83a0a59bec8fac

> Add flink-ml-lib module
> ---
>
> Key: FLINK-12758
> URL: https://issues.apache.org/jira/browse/FLINK-12758
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Luo Gen
>Assignee: Luo Gen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The Jira introduces a new module "flink-ml-lib" under flink-ml-parent.
> The flink-ml-lib is planned in the roadmap in FLIP-39, as the code base of 
> library implementations of FlinkML. This Jira only aims to create the module, 
> and algorithms will be added in separate Jira in the future.
>  For more details, please refer to [FLIP39 design 
> doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12597) Remove the legacy flink-libraries/flink-ml

2019-07-05 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-12597.
-
   Resolution: Fixed
Fix Version/s: 1.9.0

> Remove the legacy flink-libraries/flink-ml
> --
>
> Key: FLINK-12597
> URL: https://issues.apache.org/jira/browse/FLINK-12597
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Affects Versions: 1.9.0
>Reporter: Shaoxuan Wang
>Assignee: Luo Gen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> As discussed in dev-ml, we decided to delete the legacy 
> flink-libraries/flink-ml, so as to the flink-libraries/flink-ml-uber. There 
> is not any further development planned for this legacy flink-ml package in 
> 1.9 or even future. Users could just use the 1.8 version if their 
> products/projects still rely on this package.
> [1] 
> [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-ml-and-DISCUSS-Delete-flink-ml-td29057.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12597) Remove the legacy flink-libraries/flink-ml

2019-07-05 Thread Shaoxuan Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879052#comment-16879052
 ] 

Shaoxuan Wang commented on FLINK-12597:
---

Fixed in master: 2b2a83df56242aa90ee731f25d17b050b75df0f3

> Remove the legacy flink-libraries/flink-ml
> --
>
> Key: FLINK-12597
> URL: https://issues.apache.org/jira/browse/FLINK-12597
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Affects Versions: 1.9.0
>Reporter: Shaoxuan Wang
>Assignee: Luo Gen
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> As discussed in dev-ml, we decided to delete the legacy 
> flink-libraries/flink-ml, so as to the flink-libraries/flink-ml-uber. There 
> is not any further development planned for this legacy flink-ml package in 
> 1.9 or even future. Users could just use the 1.8 version if their 
> products/projects still rely on this package.
> [1] 
> [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-ml-and-DISCUSS-Delete-flink-ml-td29057.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12881) Add more functionalities for ML Params and ParamInfo class

2019-06-30 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-12881.
-
   Resolution: Fixed
Fix Version/s: 1.9.0

> Add more functionalities for ML Params and ParamInfo class
> --
>
> Key: FLINK-12881
> URL: https://issues.apache.org/jira/browse/FLINK-12881
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Xu Yang
>Assignee: Xu Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> 1. change Params$get to support aliases
>  * support aliases
>  * if the Params contains the specific parameter and alias, but has more than 
> one value or
>  * if the Params doesn't contains the specific parameter, while the ParamInfo 
> is optional but has no default value, it will throw exception
>  * when isOptional is true, contain(ParamInfo) is true, it will return the 
> value found in Params, whether the value is null or not. when isOptional is 
> true, contain(ParamInfo) is false: hasDefaultValue is true, it will return 
> defaultValue. hasDefaultValue is false, it will throw exception. developer 
> should use contain to check that Params has ParamInfo or not.
> 2. add size, clear, isEmpty, contains, fromJson in Params
> 3. fix ParamInfo, PipelineStage to adapt new Params
>  * assert null in alias
>  * use Params$loadJson in PipelineStage
> 4. add test cases about aliases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12881) Add more functionalities for ML Params and ParamInfo class

2019-06-30 Thread Shaoxuan Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875719#comment-16875719
 ] 

Shaoxuan Wang commented on FLINK-12881:
---

Fixed in master:1660c6b47af789fa5c9bf6a3ff77e868ca90

> Add more functionalities for ML Params and ParamInfo class
> --
>
> Key: FLINK-12881
> URL: https://issues.apache.org/jira/browse/FLINK-12881
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Xu Yang
>Assignee: Xu Yang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> 1. change Params$get to support aliases
>  * support aliases
>  * if the Params contains the specific parameter and alias, but has more than 
> one value or
>  * if the Params doesn't contains the specific parameter, while the ParamInfo 
> is optional but has no default value, it will throw exception
>  * when isOptional is true, contain(ParamInfo) is true, it will return the 
> value found in Params, whether the value is null or not. when isOptional is 
> true, contain(ParamInfo) is false: hasDefaultValue is true, it will return 
> defaultValue. hasDefaultValue is false, it will throw exception. developer 
> should use contain to check that Params has ParamInfo or not.
> 2. add size, clear, isEmpty, contains, fromJson in Params
> 3. fix ParamInfo, PipelineStage to adapt new Params
>  * assert null in alias
>  * use Params$loadJson in PipelineStage
> 4. add test cases about aliases



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (FLINK-12470) FLIP39: Flink ML pipeline and ML libs

2019-06-30 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-12470:
--
Comment: was deleted

(was: Fixed in master:1660c6b47af789fa5c9bf6a3ff77e868ca90)

> FLIP39: Flink ML pipeline and ML libs
> -
>
> Key: FLINK-12470
> URL: https://issues.apache.org/jira/browse/FLINK-12470
> Project: Flink
>  Issue Type: New Feature
>  Components: Library / Machine Learning
>Affects Versions: 1.9.0
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>Priority: Major
> Fix For: 1.9.0
>
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> This is the umbrella Jira for FLIP39, which intents to to enhance the 
> scalability and the ease of use of Flink ML. 
> ML Discussion thread: 
> [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html]
> Google Doc: (will convert it to an official confluence page very soon ) 
> [https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit]
> In machine learning, there are mainly two types of people. The first type is 
> MLlib developer. They need a set of standard/well abstracted core ML APIs to 
> implement the algorithms. Every ML algorithm is a certain concrete 
> implementation on top of these APIs. The second type is MLlib users who 
> utilize the existing/packaged MLlib to train or server a model.  It is pretty 
> common that the entire training or inference is constructed by a sequence of 
> transformation or algorithms. It is essential to provide a workflow/pipeline 
> API for MLlib users such that they can easily combine multiple algorithms to 
> describe the ML workflow/pipeline.
> Current Flink has a set of ML core inferences, but they are built on top of 
> dataset API. This does not quite align with the latest flink 
> [roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the 
> first class citizen and primary API for analytics use cases, while dataset 
> API will be gradually deprecated). Moreover, Flink at present does not have 
> any interface that allows MLlib users to describe an ML workflow/pipeline, 
> nor provides any approach to persist pipeline or model and reuse them in the 
> future. To solve/improve these issues, in this FLIP we propose to:
>  * Provide a new set of ML core interface (on top of Flink TableAPI)
>  * Provide a ML pipeline interface (on top of Flink TableAPI)
>  * Provide the interfaces for parameters management and pipeline persistence
>  * All the above interfaces should facilitate any new ML algorithm. We will 
> gradually add various standard ML algorithms on top of these new proposed 
> interfaces to ensure their feasibility and scalability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12470) FLIP39: Flink ML pipeline and ML libs

2019-06-30 Thread Shaoxuan Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875717#comment-16875717
 ] 

Shaoxuan Wang commented on FLINK-12470:
---

Fixed in master:1660c6b47af789fa5c9bf6a3ff77e868ca90

> FLIP39: Flink ML pipeline and ML libs
> -
>
> Key: FLINK-12470
> URL: https://issues.apache.org/jira/browse/FLINK-12470
> Project: Flink
>  Issue Type: New Feature
>  Components: Library / Machine Learning
>Affects Versions: 1.9.0
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>Priority: Major
> Fix For: 1.9.0
>
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> This is the umbrella Jira for FLIP39, which intents to to enhance the 
> scalability and the ease of use of Flink ML. 
> ML Discussion thread: 
> [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html]
> Google Doc: (will convert it to an official confluence page very soon ) 
> [https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit]
> In machine learning, there are mainly two types of people. The first type is 
> MLlib developer. They need a set of standard/well abstracted core ML APIs to 
> implement the algorithms. Every ML algorithm is a certain concrete 
> implementation on top of these APIs. The second type is MLlib users who 
> utilize the existing/packaged MLlib to train or server a model.  It is pretty 
> common that the entire training or inference is constructed by a sequence of 
> transformation or algorithms. It is essential to provide a workflow/pipeline 
> API for MLlib users such that they can easily combine multiple algorithms to 
> describe the ML workflow/pipeline.
> Current Flink has a set of ML core inferences, but they are built on top of 
> dataset API. This does not quite align with the latest flink 
> [roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the 
> first class citizen and primary API for analytics use cases, while dataset 
> API will be gradually deprecated). Moreover, Flink at present does not have 
> any interface that allows MLlib users to describe an ML workflow/pipeline, 
> nor provides any approach to persist pipeline or model and reuse them in the 
> future. To solve/improve these issues, in this FLIP we propose to:
>  * Provide a new set of ML core interface (on top of Flink TableAPI)
>  * Provide a ML pipeline interface (on top of Flink TableAPI)
>  * Provide the interfaces for parameters management and pipeline persistence
>  * All the above interfaces should facilitate any new ML algorithm. We will 
> gradually add various standard ML algorithms on top of these new proposed 
> interfaces to ensure their feasibility and scalability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12473) Add the interface of ML pipeline and ML lib

2019-05-24 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-12473.
-
   Resolution: Fixed
Fix Version/s: 1.9.0

Fixed in master:305095743ffe0bc39f76c1bda983da7d0df9e003

> Add the interface of ML pipeline and ML lib
> ---
>
> Key: FLINK-12473
> URL: https://issues.apache.org/jira/browse/FLINK-12473
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Shaoxuan Wang
>Assignee: Luo Gen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: image-2019-05-10-12-50-15-869.png
>
>   Original Estimate: 168h
>  Time Spent: 0.5h
>  Remaining Estimate: 167.5h
>
> This Jira will introduce the major interfaces for ML pipeline and ML lib.
> The major interfaces and their relationship diagram is shown as below. For 
> more details, please refer to [FLIP39 design 
> doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]]
>  
> !image-2019-05-10-12-50-15-869.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12597) Remove the legacy flink-libraries/flink-ml

2019-05-23 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-12597:
-

 Summary: Remove the legacy flink-libraries/flink-ml
 Key: FLINK-12597
 URL: https://issues.apache.org/jira/browse/FLINK-12597
 Project: Flink
  Issue Type: Sub-task
  Components: Library / Machine Learning
Affects Versions: 1.9.0
Reporter: Shaoxuan Wang
Assignee: Luo Gen


As discussed in dev-ml, we decided to delete the legacy 
flink-libraries/flink-ml, so as to the flink-libraries/flink-ml-uber. There is 
not any further development planned for this legacy flink-ml package in 1.9 or 
even future. Users could just use the 1.8 version if their products/projects 
still rely on this package.

[1] 
[http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-ml-and-DISCUSS-Delete-flink-ml-td29057.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-12473) Add the interface of ML pipeline and ML lib

2019-05-10 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-12473:
-

Assignee: Luo Gen

> Add the interface of ML pipeline and ML lib
> ---
>
> Key: FLINK-12473
> URL: https://issues.apache.org/jira/browse/FLINK-12473
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Shaoxuan Wang
>Assignee: Luo Gen
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2019-05-10-12-50-15-869.png
>
>   Original Estimate: 168h
>  Time Spent: 10m
>  Remaining Estimate: 167h 50m
>
> This Jira will introduce the major interfaces for ML pipeline and ML lib.
> The major interfaces and their relationship diagram is shown as below. For 
> more details, please refer to [FLIP39 design 
> doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]]
>  
> !image-2019-05-10-12-50-15-869.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12473) Add the interface of ML pipeline and ML lib

2019-05-10 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-12473:
--
Summary: Add the interface of ML pipeline and ML lib  (was: Add ML pipeline 
and ML lib Core API)

> Add the interface of ML pipeline and ML lib
> ---
>
> Key: FLINK-12473
> URL: https://issues.apache.org/jira/browse/FLINK-12473
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Shaoxuan Wang
>Priority: Major
> Attachments: image-2019-05-10-12-50-15-869.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This Jira will introduce the major interfaces for ML pipeline and ML lib.
> The major interfaces and their relationship diagram is shown as below. For 
> more details, please refer to [FLIP39 design 
> doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]]
>  
> !image-2019-05-10-12-50-15-869.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12473) Add ML pipeline and ML lib Core API

2019-05-09 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-12473:
-

 Summary: Add ML pipeline and ML lib Core API
 Key: FLINK-12473
 URL: https://issues.apache.org/jira/browse/FLINK-12473
 Project: Flink
  Issue Type: Sub-task
  Components: Library / Machine Learning
Reporter: Shaoxuan Wang
 Attachments: image-2019-05-10-12-50-15-869.png

This Jira will introduce the major interfaces for ML pipeline and ML lib.

The major interfaces and their relationship diagram is shown as below. For more 
details, please refer to [FLIP39 design 
doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]]

 

!image-2019-05-10-12-50-15-869.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11115) Port some flink.ml algorithms to table based

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-5.
-
Resolution: Duplicate

We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira (FLINK-12470) of FLIP39 

> Port some flink.ml algorithms to table based
> 
>
> Key: FLINK-5
> URL: https://issues.apache.org/jira/browse/FLINK-5
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> This sub-task is to port some flink.ml algorithms to table based to verify 
> the correctness of design and implementation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11114) Support wrapping inference pipeline as a UDF function in SQL

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-4.
-
Resolution: Duplicate

We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira (FLINK-12470) of FLIP39 

> Support wrapping inference pipeline as a UDF function in SQL
> 
>
> Key: FLINK-4
> URL: https://issues.apache.org/jira/browse/FLINK-4
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> Though the standard ML pipeline inference usage is table based (that is, user 
> at client construct the DAG only using table API), it is also desirable to 
> wrap the inference logic as a UDF to be used in SQL. This means it shall be 
> record based. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11113) Support periodically update models when inferencing

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-3.
-
Resolution: Duplicate

We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira (FLINK-12470) of FLIP39 

> Support periodically update models when inferencing
> ---
>
> Key: FLINK-3
> URL: https://issues.apache.org/jira/browse/FLINK-3
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> As models will be periodically updated and the inference job may running on 
> stream and will NOT finish, it is important to have this inference job 
> periodically reload the latest model for inference without start/stop the 
> inference job. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11111) Create a new set of parameters

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-1.
-
Resolution: Duplicate

We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira (FLINK-12470) of FLIP39 

> Create a new set of parameters
> --
>
> Key: FLINK-1
> URL: https://issues.apache.org/jira/browse/FLINK-1
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> One goal of new Table based ML Pipeline is easy for tooling. That is, for any 
> ML/AI algorithm adapt to this ML Pipeline standard shall declare all its 
> parameters via a well-defined interface. So that, the AI platform can 
> uniformly get/set corresponding parameters while agnostic about the details 
> of specific algorithm. The only difference between algorithms, from a user's 
> perspective, is its name. All the other algorithm parameters are self 
> descriptive. 
>  
> This will also be useful for future Flink ML SQL as the SQL parser can 
> uniformly handle all these parameter things. This can greatly simplify the 
> SQL parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11112) Support pipeline import/export

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-2.
-
Resolution: Duplicate

We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira (FLINK-12470) of FLIP39 

> Support pipeline import/export
> --
>
> Key: FLINK-2
> URL: https://issues.apache.org/jira/browse/FLINK-2
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> It is quite common to have one job for training and periodical export trained 
> pipeline and models and another job to load these exported pipeline for 
> inference. 
>  
> Thus, we will need functionalities for pipeline import/export. This shall 
> work in both streaming/batch environment. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11109) Create Table based optimizers

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-11109.
-
Resolution: Duplicate

We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira (FLINK-12470) of FLIP39 

> Create Table based optimizers
> -
>
> Key: FLINK-11109
> URL: https://issues.apache.org/jira/browse/FLINK-11109
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> The existing optimizers in org.apache.flink.ml package are dataset based. 
> This task is to create a new set of optimizers which are table based. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11108) Create a new set of table based ML Pipeline classes

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-11108.
-
Resolution: Duplicate

We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira (FLINK-12470) of FLIP39 

> Create a new set of table based ML Pipeline classes
> ---
>
> Key: FLINK-11108
> URL: https://issues.apache.org/jira/browse/FLINK-11108
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> The main classes are:
>  # PipelineStage (the trait for each pipeline stage)
>  # Estimator (training stage)
>  # Transformer (the inference/feature engineering stage)
>  # Pipeline (the whole pipeline)
>  # Predictor (extends Estimator, for supervised learning)
> Detailed design can be referred at parent JIRA's design document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11096) Create a new table based flink ML package

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-11096.
-
Resolution: Duplicate

We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira (FLINK-12470) of FLIP39 

> Create a new table based flink ML package
> -
>
> Key: FLINK-11096
> URL: https://issues.apache.org/jira/browse/FLINK-11096
> Project: Flink
>  Issue Type: Sub-task
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> Currently,  the DataSet based ML library is under org.apache._flink.ml_ scala 
> package and under _flink-libraries/flink-ml directory._
>  
> There are two questions related to packaging:
>  # Shall we create a new scala/java package, e.g. org.apache.flink.table.ml? 
> Or still stay in org.apache.flink.ml?
>  # Shall we still put new code in flink-libraries/flink-ml directory or 
> create a new one, e.g. flink-libraries/flink-table-ml and corresponding maven 
> package?
>  
> I implemented a prototype for the design and found that the new design is 
> very hard to fit into existing flink.ml codebase. The existing flink.ml code 
> is tightly coupled with DataSet API. Thus, I have to rewrite almost all parts 
> of flink.ml to get some sample case to work. The only reusable code from 
> flink.ml are the base math classes under _org.apache.flink.ml.math_ and 
> _org.apache.flink.ml.metrics.distance_ packages. 
> Considering this fact, I will prefer to create a new package 
> org.apache.flink.table.ml and a new maven package flink-table-ml.
>  
> Please feel free to give your feedbacks. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11095) Table based ML Pipeline

2019-05-09 Thread Shaoxuan Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836837#comment-16836837
 ] 

Shaoxuan Wang edited comment on FLINK-11095 at 5/10/19 2:48 AM:


We decided to move the entire ml pipeline and ml lib development under the 
umbrella Jira ([FLINK-12470|https://issues.apache.org/jira/browse/FLINK-12470]) 
of FLIP39 


was (Author: shaoxuanwang):
We decided to move the entire ml pipeline and ml lib development under FLIP39

> Table based ML Pipeline
> ---
>
> Key: FLINK-11095
> URL: https://issues.apache.org/jira/browse/FLINK-11095
> Project: Flink
>  Issue Type: New Feature
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> As Table API will be the unified high level API for both batch and streaming, 
> it is desired to have a table API based ML Pipeline definition. 
> This new table based ML Pipeline shall:
>  # support unified batch/stream ML/AI functionalities (train/inference).  
>  # seamless integrated with flink Table based ecosystem. 
>  # provide a base for further flink based AI platform/tooling support.
>  # provide a base for further flink ML SQL integration. 
> The initial design is here 
> [https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing.]
> And based on this design, I made some initial implementation/prototyping. I 
> will share the code later.
> This is the umbrella JIRA. I will create corresponding sub-jira for each 
> sub-task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11095) Table based ML Pipeline

2019-05-09 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-11095.
-
Resolution: Fixed

We decided to move the entire ml pipeline and ml lib development under FLIP39

> Table based ML Pipeline
> ---
>
> Key: FLINK-11095
> URL: https://issues.apache.org/jira/browse/FLINK-11095
> Project: Flink
>  Issue Type: New Feature
>  Components: Library / Machine Learning
>Reporter: Weihua Jiang
>Priority: Major
>
> As Table API will be the unified high level API for both batch and streaming, 
> it is desired to have a table API based ML Pipeline definition. 
> This new table based ML Pipeline shall:
>  # support unified batch/stream ML/AI functionalities (train/inference).  
>  # seamless integrated with flink Table based ecosystem. 
>  # provide a base for further flink based AI platform/tooling support.
>  # provide a base for further flink ML SQL integration. 
> The initial design is here 
> [https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing.]
> And based on this design, I made some initial implementation/prototyping. I 
> will share the code later.
> This is the umbrella JIRA. I will create corresponding sub-jira for each 
> sub-task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12470) FLIP39: Flink ML pipeline and ML libs

2019-05-09 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-12470:
-

 Summary: FLIP39: Flink ML pipeline and ML libs
 Key: FLINK-12470
 URL: https://issues.apache.org/jira/browse/FLINK-12470
 Project: Flink
  Issue Type: New Feature
  Components: Library / Machine Learning
Affects Versions: 1.9.0
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang
 Fix For: 1.9.0


This is the umbrella Jira for FLIP39, which intents to to enhance the 
scalability and the ease of use of Flink ML. 

ML Discussion thread: 
[http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html]

Google Doc: (will convert it to an official confluence page very soon ) 
[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit]

In machine learning, there are mainly two types of people. The first type is 
MLlib developer. They need a set of standard/well abstracted core ML APIs to 
implement the algorithms. Every ML algorithm is a certain concrete 
implementation on top of these APIs. The second type is MLlib users who utilize 
the existing/packaged MLlib to train or server a model.  It is pretty common 
that the entire training or inference is constructed by a sequence of 
transformation or algorithms. It is essential to provide a workflow/pipeline 
API for MLlib users such that they can easily combine multiple algorithms to 
describe the ML workflow/pipeline.

Current Flink has a set of ML core inferences, but they are built on top of 
dataset API. This does not quite align with the latest flink 
[roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the first 
class citizen and primary API for analytics use cases, while dataset API will 
be gradually deprecated). Moreover, Flink at present does not have any 
interface that allows MLlib users to describe an ML workflow/pipeline, nor 
provides any approach to persist pipeline or model and reuse them in the 
future. To solve/improve these issues, in this FLIP we propose to:
 * Provide a new set of ML core interface (on top of Flink TableAPI)
 * Provide a ML pipeline interface (on top of Flink TableAPI)
 * Provide the interfaces for parameters management and pipeline persistence
 * All the above interfaces should facilitate any new ML algorithm. We will 
gradually add various standard ML algorithms on top of these new proposed 
interfaces to ensure their feasibility and scalability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11004) Wrong ProcessWindowFunction.process argument in example of Incremental Window Aggregation with ReduceFunction

2018-11-28 Thread Shaoxuan Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-11004.
-
Resolution: Fixed

Fixed and merged into master

> Wrong ProcessWindowFunction.process argument in example of Incremental Window 
> Aggregation with ReduceFunction
> -
>
> Key: FLINK-11004
> URL: https://issues.apache.org/jira/browse/FLINK-11004
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.2, 1.5.5, 1.6.2
>Reporter: Yuanyang Wu
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> [https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/windows.html#incremental-window-aggregation-with-reducefunction]
> Example use wrong "window" argument in process()
> {{JAVA example}}
> {{- out.collect(new Tuple2 SensorReading>({color:#d04437}window{color}.getStart(), min));}}
> {{+ out.collect(new Tuple2 SensorReading>({color:#14892c}context.window(){color}.getStart(), min));}}
>  
> {{Scala example 2nd argument should be context:Context instead of window}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-5905) Add user-defined aggregation functions to documentation.

2017-08-15 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127439#comment-16127439
 ] 

Shaoxuan Wang commented on FLINK-5905:
--

This issue will be solved once FLINK-6751 is merged

> Add user-defined aggregation functions to documentation.
> 
>
> Key: FLINK-5905
> URL: https://issues.apache.org/jira/browse/FLINK-5905
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Fabian Hueske
>Assignee: Shaoxuan Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7194) Add getResultType and getAccumulatorType to AggregateFunction

2017-07-19 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092983#comment-16092983
 ] 

Shaoxuan Wang edited comment on FLINK-7194 at 7/19/17 12:15 PM:


[~fhueske], I am ok with your proposal for the changes to the 
{{AggregateFunction}}


was (Author: shaoxuanwang):
[~fhueske], I am ok with your proposal for the changes to the AggregateFunction

> Add getResultType and getAccumulatorType to AggregateFunction
> -
>
> Key: FLINK-7194
> URL: https://issues.apache.org/jira/browse/FLINK-7194
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Affects Versions: 1.4.0
>Reporter: Fabian Hueske
>
> FLINK-6725 and FLINK-6457 proposed to remove methods with default 
> implementations such as {{getResultType()}}, {{toString()}}, or 
> {{requiresOver()}} from the base classes of user-defined methods (UDF, UDTF, 
> UDAGG) and instead offer them as contract methods which are dynamically 
> In PR [#3993|https://github.com/apache/flink/pull/3993] I argued that these 
> methods have a fixed signature (in contrast to the {{eval()}}, 
> {{accumulate()}} and {{retract()}} methods) and should be kept in the 
> classes. For users that don't need these methods, this doesn't make a 
> difference because the methods are not abstract and have a default 
> implementation. For users that need to override the methods it makes a 
> difference, because they get IDE and compiler support when overriding them 
> and the cannot get the signature wrong.
> Consequently, I propose to add {{getResultType()}} and 
> {{getAccumulatorType()}} as methods with default implementation to 
> {{AggregateFunction}}. This will make the interface of {{AggregateFunction}} 
> more consistent with {{ScalarFunction}} and {{TableFunction}}.
> What do you think [~shaoxuan], [~RuidongLi] and [~jark]?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7194) Add getResultType and getAccumulatorType to AggregateFunction

2017-07-19 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092983#comment-16092983
 ] 

Shaoxuan Wang commented on FLINK-7194:
--

[~fhueske], I am ok with your proposal for the changes to the AggregateFunction

> Add getResultType and getAccumulatorType to AggregateFunction
> -
>
> Key: FLINK-7194
> URL: https://issues.apache.org/jira/browse/FLINK-7194
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Affects Versions: 1.4.0
>Reporter: Fabian Hueske
>
> FLINK-6725 and FLINK-6457 proposed to remove methods with default 
> implementations such as {{getResultType()}}, {{toString()}}, or 
> {{requiresOver()}} from the base classes of user-defined methods (UDF, UDTF, 
> UDAGG) and instead offer them as contract methods which are dynamically 
> In PR [#3993|https://github.com/apache/flink/pull/3993] I argued that these 
> methods have a fixed signature (in contrast to the {{eval()}}, 
> {{accumulate()}} and {{retract()}} methods) and should be kept in the 
> classes. For users that don't need these methods, this doesn't make a 
> difference because the methods are not abstract and have a default 
> implementation. For users that need to override the methods it makes a 
> difference, because they get IDE and compiler support when overriding them 
> and the cannot get the signature wrong.
> Consequently, I propose to add {{getResultType()}} and 
> {{getAccumulatorType()}} as methods with default implementation to 
> {{AggregateFunction}}. This will make the interface of {{AggregateFunction}} 
> more consistent with {{ScalarFunction}} and {{TableFunction}}.
> What do you think [~shaoxuan], [~RuidongLi] and [~jark]?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6751) Table API / SQL Docs: UDFs Page

2017-07-19 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092970#comment-16092970
 ] 

Shaoxuan Wang commented on FLINK-6751:
--

Sorry for being no update. I will work on this in the next 3-4 days.

> Table API / SQL Docs: UDFs Page
> ---
>
> Key: FLINK-6751
> URL: https://issues.apache.org/jira/browse/FLINK-6751
> Project: Flink
>  Issue Type: Task
>  Components: Documentation, Table API & SQL
>Affects Versions: 1.3.0
>Reporter: Fabian Hueske
>Assignee: Shaoxuan Wang
>
> Update and extend the documentation of UDFs in the Table API / SQL: 
> {{./docs/dev/table/udfs.md}}
> Missing sections:
> - Registration of UDFs
> - UDAGGs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (FLINK-6725) make requiresOver as a contracted method in udagg

2017-07-17 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-6725.

Resolution: Won't Fix

> make requiresOver as a contracted method in udagg
> -
>
> Key: FLINK-6725
> URL: https://issues.apache.org/jira/browse/FLINK-6725
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> I realized requiresOver is defined in the udagg interface when I wrote up the 
> udagg doc. I would like to put requiresOver as a contract method. This makes 
> the entire udagg interface consistently and clean.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (FLINK-5878) Add stream-stream inner/left-out join

2017-07-17 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-5878:
-
Summary: Add stream-stream inner/left-out join  (was: Add stream-stream 
inner join on TableAPI)

> Add stream-stream inner/left-out join
> -
>
> Key: FLINK-5878
> URL: https://issues.apache.org/jira/browse/FLINK-5878
> Project: Flink
>  Issue Type: New Feature
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> This task is intended to support stream-stream inner join on tableAPI. A 
> brief design doc is created: 
> https://docs.google.com/document/d/10oJDw-P9fImD5tc3Stwr7aGIhvem4l8uB2bjLzK72u4/edit
> We propose to use the mapState as the backend state interface for this "join" 
> operator, so this task requires FLINK-4856.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (FLINK-5878) Add stream-stream inner/left-out join

2017-07-17 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-5878:


Assignee: Hequn Cheng  (was: Shaoxuan Wang)

> Add stream-stream inner/left-out join
> -
>
> Key: FLINK-5878
> URL: https://issues.apache.org/jira/browse/FLINK-5878
> Project: Flink
>  Issue Type: New Feature
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Hequn Cheng
>
> This task is intended to support stream-stream inner join on tableAPI. A 
> brief design doc is created: 
> https://docs.google.com/document/d/10oJDw-P9fImD5tc3Stwr7aGIhvem4l8uB2bjLzK72u4/edit
> We propose to use the mapState as the backend state interface for this "join" 
> operator, so this task requires FLINK-4856.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-7146) FLINK SQLs support DDL

2017-07-12 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083614#comment-16083614
 ] 

Shaoxuan Wang commented on FLINK-7146:
--

[~yuemeng], DDL is an important interface (it is a customer interface, once we 
release it is not easy to revise it.) which we have been working on and 
polishing for couple of months. We are hesitating to propose something as there 
are still several design details that are under debate. We have opened 
FLINK-6962 for the similar purpose, will post a detailed design doc very soon. 
Thanks.

> FLINK SQLs support DDL
> --
>
> Key: FLINK-7146
> URL: https://issues.apache.org/jira/browse/FLINK-7146
> Project: Flink
>  Issue Type: New Feature
>  Components: Table API & SQL
>Reporter: yuemeng
>
> For now,Flink SQL can't support DDL, we can only register a table by call 
> registerTableInternal in TableEnvironment
> we should support DDL for sql such as create a table or create function like:
> {code}
> CREATE TABLE kafka_source (
>   id INT,
>   price INT
> ) PROPERTIES (
>   category = 'source',
>   type = 'kafka',
>   version = '0.9.0.1',
>   separator = ',',
>   topic = 'test',
>   brokers = 'xx:9092',
>   group_id = 'test'
> );
> CREATE TABLE db_sink (
>   id INT,
>   price DOUBLE
> ) PROPERTIES (
>   category = 'sink',
>   type = 'mysql',
>   table_name = 'udaf_test',
>   url = 
> 'jdbc:mysql://127.0.0.1:3308/ds?useUnicode=true=UTF8',
>   username = 'ds_dev',
>   password = 's]k51_(>R'
> );
> CREATE TEMPORARY function 'AVGUDAF' AS 
> 'com.x.server.codegen.aggregate.udaf.avg.IntegerAvgUDAF';
> INSERT INTO db_sink SELECT id ,AVGUDAF(price) FROM kafka_source group by id
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-6962) SQL DDL for input and output tables

2017-06-21 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6962:


 Summary: SQL DDL for input and output tables
 Key: FLINK-6962
 URL: https://issues.apache.org/jira/browse/FLINK-6962
 Project: Flink
  Issue Type: New Feature
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: lincoln.lee
 Fix For: 1.4.0


This Jira adds support to allow user define the DDL for source and sink tables, 
including the waterMark(on source table) and emit SLA (on result table). The 
detailed design doc will be attached soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-6963) User Defined Operator

2017-06-21 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6963:


 Summary: User Defined Operator
 Key: FLINK-6963
 URL: https://issues.apache.org/jira/browse/FLINK-6963
 Project: Flink
  Issue Type: New Feature
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Jark Wu
 Fix For: 1.4.0


We are proposing to add an User Defined Operator (UDOP) interface. As oppose to 
UDF(scalars to scalar)/UDTF(scalar to table)/UDAGG(table to scalar), this UDOP 
allows user to describe a business logic for a table to table conversion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-6961) Enable configurable early-firing rate

2017-06-21 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6961:


 Summary: Enable configurable early-firing rate 
 Key: FLINK-6961
 URL: https://issues.apache.org/jira/browse/FLINK-6961
 Project: Flink
  Issue Type: New Feature
  Components: Table API & SQL
Reporter: Shaoxuan Wang


There are cases that we need to emit the result earlier to allow user get an 
observation/sample of the result. Right now, only the unbounded aggregate works 
in early-firing mode (in the future we will support early firing for all 
different scenarios, like windowed aggregate, unbounded/windowed join, etc.). 
But in unbounded aggregate, the result is prepared and emitted for each input. 
This may not be necessary, as user may not need to get the result so frequent 
in most cases.
We create this Jira to track all the efforts (sub-jira) to enable configurable 
early-firing rate. It should be noted that the early-firing rate will not be 
exposed to the user, it will be smartly decided by the query optimizer 
depending on the SLA(allowed latency) of the final result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5354) Split up Table API documentation into multiple pages

2017-05-28 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027842#comment-16027842
 ] 

Shaoxuan Wang commented on FLINK-5354:
--

[~fhueske], thanks for kicking off this. I would like to work on the page for 
UDFs, as it seems the major missing part is UDAGG. 

> Split up Table API documentation into multiple pages 
> -
>
> Key: FLINK-5354
> URL: https://issues.apache.org/jira/browse/FLINK-5354
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, Table API & SQL
>Reporter: Timo Walther
>
> The Table API documentation page is quite large at the moment. We should 
> split it up into multiple pages:
> Here is my suggestion:
> - Overview (Datatypes, Config, Registering Tables, Examples)
> - TableSources and Sinks
> - Table API
> - SQL
> - Functions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-6751) Table API / SQL Docs: UDFs Page

2017-05-28 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-6751:


Assignee: Shaoxuan Wang

> Table API / SQL Docs: UDFs Page
> ---
>
> Key: FLINK-6751
> URL: https://issues.apache.org/jira/browse/FLINK-6751
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation, Table API & SQL
>Affects Versions: 1.3.0
>Reporter: Fabian Hueske
>Assignee: Shaoxuan Wang
>
> Update and refine {{./docs/dev/table/udfs.md}} in feature branch 
> https://github.com/apache/flink/tree/tableDocs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6725) make requiresOver as a contracted method in udagg

2017-05-25 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6725:


 Summary: make requiresOver as a contracted method in udagg
 Key: FLINK-6725
 URL: https://issues.apache.org/jira/browse/FLINK-6725
 Project: Flink
  Issue Type: Improvement
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


I realized requiresOver is defined in the udagg interface when I wrote up the 
udagg doc. I would like to put requiresOver as a contract method. This makes 
the entire udagg interface consistently and clean.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-6637) Move registerFunction to TableEnvironment

2017-05-23 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-6637:
-
Component/s: Table API & SQL

> Move registerFunction to TableEnvironment
> -
>
> Key: FLINK-6637
> URL: https://issues.apache.org/jira/browse/FLINK-6637
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> We are trying to unify the stream and batch. This unification should cover 
> the tableAPI query as well as the function registration (as part of DDL). 
> Currently the registerFunction for UDTF and UDAGG are defined in 
> BatchTableEnvironment and StreamTableEnvironment separately.  We should move 
> registerFunction to TableEnvironment.
> The reason that we did not put registerFunction into TableEnvironment for 
> UDTF and UDAGG is that we need different registerFunction for java and scala 
> codes, as java needs a special way to generate and pass implicit value of 
> typeInfo:
> {code:xml}
> implicit val typeInfo: TypeInformation[T] = TypeExtractor
>   .createTypeInfo(tf, classOf[TableFunction[_]], tf.getClass, 0)
>   .asInstanceOf[TypeInformation[T]]
> {code}
> It seems that we need duplicate TableEnvironment class, one for java and one 
> for scala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6637) Move registerFunction to TableEnvironment

2017-05-19 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6637:


 Summary: Move registerFunction to TableEnvironment
 Key: FLINK-6637
 URL: https://issues.apache.org/jira/browse/FLINK-6637
 Project: Flink
  Issue Type: Improvement
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


We are trying to unify the stream and batch. This unification should cover the 
tableAPI query as well as the function registration (as part of DDL). 
Currently the registerFunction for UDTF and UDAGG are defined in 
BatchTableEnvironment and StreamTableEnvironment separately.  We should move 
registerFunction to TableEnvironment.

The reason that we did not put registerFunction into TableEnvironment for UDTF 
and UDAGG is that we need different registerFunction for java and scala codes, 
as java needs a special way to generate and pass implicit value of typeInfo:
{code:xml}
implicit val typeInfo: TypeInformation[T] = TypeExtractor
  .createTypeInfo(tf, classOf[TableFunction[_]], tf.getClass, 0)
  .asInstanceOf[TypeInformation[T]]
{code}

It seems that we need duplicate TableEnvironment class, one for java and one 
for scala.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-6544) Expose State Backend Interface for UDAGG

2017-05-12 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-6544:
-
Summary: Expose State Backend Interface for UDAGG  (was: Expose Backend 
State Interface for UDAGG)

> Expose State Backend Interface for UDAGG
> 
>
> Key: FLINK-6544
> URL: https://issues.apache.org/jira/browse/FLINK-6544
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Kaibo Zhou
>Assignee: Kaibo Zhou
>
> Currently UDAGG users can not access state, it's necessary to provide users 
> with a convenient and efficient way to access the state within the UDAGG.
> This is the design doc: 
> https://docs.google.com/document/d/1g-wHOuFj5pMaYMJg90kVIO2IHbiQiO26nWscLIOn50c/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-6544) Expose Backend State Interface for UDAGG

2017-05-12 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-6544:


Assignee: Kaibo Zhou

> Expose Backend State Interface for UDAGG
> 
>
> Key: FLINK-6544
> URL: https://issues.apache.org/jira/browse/FLINK-6544
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Kaibo Zhou
>Assignee: Kaibo Zhou
>
> Currently UDAGG users can not access state, it's necessary to provide users 
> with a convenient and efficient way to access the state within the UDAGG.
> This is the design doc: 
> https://docs.google.com/document/d/1g-wHOuFj5pMaYMJg90kVIO2IHbiQiO26nWscLIOn50c/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (FLINK-5433) initiate function of Aggregate does not take effect for DataStream aggregation

2017-05-04 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang closed FLINK-5433.

Resolution: Fixed

Fixed by UDAGG subtasks FLINK-5564

> initiate function of Aggregate does not take effect for DataStream aggregation
> --
>
> Key: FLINK-5433
> URL: https://issues.apache.org/jira/browse/FLINK-5433
> Project: Flink
>  Issue Type: Bug
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> The initiate function of Aggregate works for dataset aggregation, but does 
> not work for DataStream aggregation.
> For instance, when giving an initial value, say 2, for CountAggregate. The 
> result of dataset aggregate will take this change into account, but 
> dataStream aggregate will not.
> {code}
> class CountAggregate extends Aggregate[Long] {
>   override def initiate(intermediate: Row): Unit = {
> intermediate.setField(countIndex, 2L)
>   }
> }
> {code}
> The output for dataset test(testWorkingAggregationDataTypes) will result in
> .select('_1.avg, '_2.avg, '_3.avg, '_4.avg, '_5.avg, '_6.avg, '_7.count)
>  expected: [1,1,1,1,1.5,1.5,2]
>  received: [1,1,1,1,1.5,1.5,4] (the result of last count aggregate is bigger 
> than expect value by 2, as expected)
> But the output for datastream 
> test(testProcessingTimeSlidingGroupWindowOverCount) will remain the same:
> .select('string, 'int.count, 'int.avg)
> Expected :List(Hello world,1,3, Hello world,2,3, Hello,1,2, Hello,2,2, Hi,1,1)
> Actual   :MutableList(Hello world,1,3, Hello world,2,3, Hello,1,2, Hello,2,2, 
> Hi,1,1)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (FLINK-5915) Add support for the aggregate on multi fields

2017-05-04 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang resolved FLINK-5915.
--
Resolution: Fixed

This is completely resolved with FLINK-5906.

> Add support for the aggregate on multi fields
> -
>
> Key: FLINK-5915
> URL: https://issues.apache.org/jira/browse/FLINK-5915
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
> Fix For: 1.3.0
>
>
> some UDAGGs have multi-fields as input. For instance,
> table
> .window(Tumble over 10.minutes on 'rowtime as 'w )
> .groupBy('key, 'w)
> .select('key, weightedAvg('value, 'weight))
> This task will add the support for the aggregate on multi fields.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-5905) Add user-defined aggregation functions to documentation.

2017-05-02 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-5905:


Assignee: Shaoxuan Wang

> Add user-defined aggregation functions to documentation.
> 
>
> Key: FLINK-5905
> URL: https://issues.apache.org/jira/browse/FLINK-5905
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Fabian Hueske
>Assignee: Shaoxuan Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-5906) Add support to register UDAGG in Table and SQL API

2017-05-02 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-5906:
-
Summary: Add support to register UDAGG in Table and SQL API  (was: Add 
support to register UDAGGs in TableEnvironment)

> Add support to register UDAGG in Table and SQL API
> --
>
> Key: FLINK-5906
> URL: https://issues.apache.org/jira/browse/FLINK-5906
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Fabian Hueske
>Assignee: Shaoxuan Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6361) Finalize the AggregateFunction interface and refactoring built-in aggregates

2017-04-22 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6361:


 Summary: Finalize the AggregateFunction interface and refactoring 
built-in aggregates
 Key: FLINK-6361
 URL: https://issues.apache.org/jira/browse/FLINK-6361
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


We have completed codeGen for all aggregate runtime functions. Now we can 
finalize the AggregateFunction. This includes 1) remove Accumulator trait; 2) 
remove accumulate, retract, merge, resetAccumulator, getAccumulatorType methods 
from interface, and allow them as contracted methods for UDAGG; 3) refactoring 
the built-in aggregates accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (FLINK-5813) code generation for user-defined aggregate functions

2017-04-22 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang resolved FLINK-5813.
--
Resolution: Fixed

All required issues are solved.

> code generation for user-defined aggregate functions
> 
>
> Key: FLINK-5813
> URL: https://issues.apache.org/jira/browse/FLINK-5813
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> The input and return types of the new proposed UDAGG functions are 
> dynamically given by the users. All these user defined functions have to be 
> generated via codegen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-6334) Refactoring UDTF interface

2017-04-19 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-6334:
-
Description: The current UDTF leverages the table.join(expression) 
interface, which is not a proper interface in terms of semantics. We would like 
to refactor this to let UDTF use table.join(table) interface. Very briefly,  
UDTF's apply method will return a Table Type, so Join(UDTF('a, 'b, ...) as 'c) 
shall be viewed as join(Table)  (was: UDTF's apply method returns a Table Type, 
so Join(UDTF('a, 'b, ...) as 'c) shall be viewed as join(Table))

> Refactoring UDTF interface
> --
>
> Key: FLINK-6334
> URL: https://issues.apache.org/jira/browse/FLINK-6334
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Ruidong Li
>
> The current UDTF leverages the table.join(expression) interface, which is not 
> a proper interface in terms of semantics. We would like to refactor this to 
> let UDTF use table.join(table) interface. Very briefly,  UDTF's apply method 
> will return a Table Type, so Join(UDTF('a, 'b, ...) as 'c) shall be viewed as 
> join(Table)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5564) User Defined Aggregates

2017-04-01 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15952127#comment-15952127
 ] 

Shaoxuan Wang commented on FLINK-5564:
--

[~fhueske], implementing the entire codeGen into one PR results in thousands 
lines of code changes. it is hard to be reviewed. I split FLINK-5813 into three 
sub-tasks.

> User Defined Aggregates
> ---
>
> Key: FLINK-5564
> URL: https://issues.apache.org/jira/browse/FLINK-5564
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> User-defined aggregates would be a great addition to the Table API / SQL.
> The current aggregate interface is not well suited for the external users.  
> This issue proposes to redesign the aggregate such that we can expose an 
> better external UDAGG interface to the users. The detailed design proposal 
> can be found here: 
> https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit
> Motivation:
> 1. The current aggregate interface is not very concise to the users. One 
> needs to know the design details of the intermediate Row buffer before 
> implements an Aggregate. Seven functions are needed even for a simple Count 
> aggregate.
> 2. Another limitation of current aggregate function is that it can only be 
> applied on one single column. There are many scenarios which require the 
> aggregate function taking multiple columns as the inputs.
> 3. “Retraction” is not considered and covered in the current Aggregate.
> 4. It might be very good to have a local/global aggregate query plan 
> optimization, which is very promising to optimize UDAGG performance in some 
> scenarios.
> Proposed Changes:
> 1. Implement an aggregate dataStream API (Done by 
> [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582])
> 2. Update all the existing aggregates to use the new aggregate dataStream API
> 3. Provide a better User-Defined Aggregate interface
> 4. Add retraction support
> 5. Add local/global aggregate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6242) codeGen DataSet Goupingwindow Aggregates

2017-04-01 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6242:


 Summary: codeGen DataSet Goupingwindow Aggregates
 Key: FLINK-6242
 URL: https://issues.apache.org/jira/browse/FLINK-6242
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6241) codeGen dataStream aggregates that use ProcessFunction

2017-04-01 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6241:


 Summary: codeGen dataStream aggregates that use ProcessFunction
 Key: FLINK-6241
 URL: https://issues.apache.org/jira/browse/FLINK-6241
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6240) codeGen dataStream aggregates that use AggregateAggFunction

2017-04-01 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6240:


 Summary: codeGen dataStream aggregates that use 
AggregateAggFunction
 Key: FLINK-6240
 URL: https://issues.apache.org/jira/browse/FLINK-6240
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6216) DataStream unbounded groupby aggregate with early firing

2017-03-29 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6216:


 Summary: DataStream unbounded groupby aggregate with early firing
 Key: FLINK-6216
 URL: https://issues.apache.org/jira/browse/FLINK-6216
 Project: Flink
  Issue Type: New Feature
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Groupby aggregate results in a replace table. For infinite groupby aggregate, 
we need a mechanism to define when the data should be emitted (early-fired). 
This task is aimed to implement the initial version of unbounded groupby 
aggregate, where we update and emit aggregate value per each arrived record. In 
the future, we will implement the mechanism and interface to let user define 
the frequency/period of early-firing the unbounded groupby aggregation results.

The limit space of backend state is one of major obstacles for supporting 
unbounded groupby aggregate in practical. Due to this reason, we suggest two 
common (and very useful) use-cases of this unbounded groupby aggregate:
1. The range of grouping key is limit. In this case, a new arrival record will 
either insert to state as new record or replace the existing record in the 
backend state. The data in the backend state will not be evicted if the 
resource is properly provisioned by the user, such that we can provision the 
correctness on aggregation results.
2. When the grouping key is unlimited, we will not be able ensure the 100% 
correctness of "unbounded groupby aggregate". In this case, we will reply on 
the TTL mechanism of the RocksDB backend state to evicted old data such that we 
can provision the correct results in a certain time range.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-6091) Implement and turn on the retraction for grouping window aggregate

2017-03-24 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-6091:


Assignee: Hequn Cheng  (was: Shaoxuan Wang)

> Implement and turn on the retraction for grouping window aggregate
> --
>
> Key: FLINK-6091
> URL: https://issues.apache.org/jira/browse/FLINK-6091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Hequn Cheng
>
> Implement the functions for processing retract message for grouping window 
> aggregate. No retract generating function needed as for now, as the current 
> grouping window aggregates are all executed at “without early firing mode”.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-6090) Implement optimizer for retraction and turn on retraction for over window aggregate

2017-03-24 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-6090:


Assignee: Hequn Cheng  (was: Shaoxuan Wang)

> Implement optimizer for retraction and turn on retraction for over window 
> aggregate
> ---
>
> Key: FLINK-6090
> URL: https://issues.apache.org/jira/browse/FLINK-6090
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Hequn Cheng
>
> Implement optimizer for retraction and turn on the retraction for over window 
> as the first prototype example:
>   1.Add RetractionRule at the stage of decoration,which can derive the 
> replace table/append table, NeedRetraction property.
>   2. Match the NeedRetraction and replace table, mark the accumulating mode; 
> Add the necessary retract generate function at the replace table, and add the 
> retract process logic at the retract consumer
>   3. turn on retraction for over window aggregate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-6089) Implement decoration phase for rewriting predicated logical plan after volcano optimization phase

2017-03-17 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-6089:


Assignee: Hequn Cheng  (was: Shaoxuan Wang)

> Implement decoration phase for rewriting predicated logical plan after 
> volcano optimization phase
> -
>
> Key: FLINK-6089
> URL: https://issues.apache.org/jira/browse/FLINK-6089
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Shaoxuan Wang
>Assignee: Hequn Cheng
>
> At present, there is no chance to modify the DataStreamRel tree after the 
> volcano optimization. We consider to add a decoration phase after volcano 
> optimization phase. Decoration phase is dedicated for rewriting predicated 
> logical plan and is independent of cost module. After decoration phase is 
> added, we get the chance to apply retraction rules at this phase.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-6047) Master Jira for "Retraction for Flink Streaming"

2017-03-16 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-6047:
-
Description: 
[Design doc]:
https://docs.google.com/document/d/18XlGPcfsGbnPSApRipJDLPg5IFNGTQjnz7emkVpZlkw

[Introduction]:
"Retraction" is an important building block for data streaming to refine the 
early fired results in streaming. “Early firing” are very common and widely 
used in many streaming scenarios, for instance “window-less” or unbounded 
aggregate and stream-stream inner join, windowed (with early firing) aggregate 
and stream-stream inner join. There are mainly two cases that require 
retractions: 1) update on the keyed table (the key is either a primaryKey (PK) 
on source table, or a groupKey/partitionKey in an aggregate); 2) When dynamic 
windows (e.g., session window) are in use, the new value may be replacing more 
than one previous window due to window merging. 

To the best of our knowledge, the retraction for the early fired streaming 
results has never been practically solved before. In this proposal, we develop 
a retraction solution and explain how it works for the problem of “update on 
the keyed table”. The same solution can be easily extended for the dynamic 
windows merging, as the key component of retraction - how to refine an early 
fired results - is the same across different problems.  

[Proposed Jiras]:
Implement decoration phase for rewriting predicated logical plan after volcano 
optimization phase
Implement optimizer for retraction and turn on retraction for over window 
aggregate
Implement and turn on the retraction for grouping window aggregate
Implement and turn on retraction for table source
Implement and turn on retraction for table sink
Implement and turn on retraction for stream-stream inner join
Implement the retraction for the early firing window
Implement the retraction for the dynamic window with early firing



  was:
[Design doc]:
https://docs.google.com/document/d/18XlGPcfsGbnPSApRipJDLPg5IFNGTQjnz7emkVpZlkw

[Introduction]:
"Retraction" is an important building block for data streaming to refine the 
early fired results in streaming. “Early firing” are very common and widely 
used in many streaming scenarios, for instance “window-less” or unbounded 
aggregate and stream-stream inner join, windowed (with early firing) aggregate 
and stream-stream inner join. There are mainly two cases that require 
retractions: 1) update on the keyed table (the key is either a primaryKey (PK) 
on source table, or a groupKey/partitionKey in an aggregate); 2) When dynamic 
windows (e.g., session window) are in use, the new value may be replacing more 
than one previous window due to window merging. 

To the best of our knowledge, the retraction for the early fired streaming 
results has never been practically solved before. In this proposal, we develop 
a retraction solution and explain how it works for the problem of “update on 
the keyed table”. The same solution can be easily extended for the dynamic 
windows merging, as the key component of retraction - how to refine an early 
fired results - is the same across different problems.  

[Proposed Jiras]:
Implement decoration phase for predicated logical plan rewriting after volcano 
optimization phase
Add source with table primary key and replace table property
Add sink tableInsert and NeedRetract property
Implement the retraction for partitioned unbounded over window aggregate
Implement the retraction for stream-stream inner join
Implement the retraction for the early firing window
Implement the retraction for the dynamic window with early firing




> Master Jira for "Retraction for Flink Streaming"
> 
>
> Key: FLINK-6047
> URL: https://issues.apache.org/jira/browse/FLINK-6047
> Project: Flink
>  Issue Type: New Feature
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> [Design doc]:
> https://docs.google.com/document/d/18XlGPcfsGbnPSApRipJDLPg5IFNGTQjnz7emkVpZlkw
> [Introduction]:
> "Retraction" is an important building block for data streaming to refine the 
> early fired results in streaming. “Early firing” are very common and widely 
> used in many streaming scenarios, for instance “window-less” or unbounded 
> aggregate and stream-stream inner join, windowed (with early firing) 
> aggregate and stream-stream inner join. There are mainly two cases that 
> require retractions: 1) update on the keyed table (the key is either a 
> primaryKey (PK) on source table, or a groupKey/partitionKey in an aggregate); 
> 2) When dynamic windows (e.g., session window) are in use, the new value may 
> be replacing more than one previous window due to window merging. 
> To the best of our knowledge, the retraction for the early fired streaming 
> results has never been practically solved before. In this 

[jira] [Created] (FLINK-6094) Implement and turn on retraction for stream-stream inner join

2017-03-16 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6094:


 Summary: Implement and turn on retraction for stream-stream inner 
join
 Key: FLINK-6094
 URL: https://issues.apache.org/jira/browse/FLINK-6094
 Project: Flink
  Issue Type: Sub-task
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


This includes:
Modify the RetractionRule to consider stream-stream inner join
Implement the retract process logic for join



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6093) Implement and turn on retraction for table sink

2017-03-16 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6093:


 Summary: Implement and turn on retraction for table sink 
 Key: FLINK-6093
 URL: https://issues.apache.org/jira/browse/FLINK-6093
 Project: Flink
  Issue Type: Sub-task
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Add sink tableInsert and NeedRetract property, and consider table sink in 
optimizer RetractionRule



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6092) Implement and turn on retraction for table source

2017-03-16 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6092:


 Summary: Implement and turn on retraction for table source 
 Key: FLINK-6092
 URL: https://issues.apache.org/jira/browse/FLINK-6092
 Project: Flink
  Issue Type: Sub-task
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Add the Primary Key and replace/append properties for table source, and 
consider table source in optimizer RetractionRule 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6091) Implement and turn on the retraction for grouping window aggregate

2017-03-16 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6091:


 Summary: Implement and turn on the retraction for grouping window 
aggregate
 Key: FLINK-6091
 URL: https://issues.apache.org/jira/browse/FLINK-6091
 Project: Flink
  Issue Type: Sub-task
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Implement the functions for processing retract message for grouping window 
aggregate. No retract generating function needed as for now, as the current 
grouping window aggregates are all executed at “without early firing mode”.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6090) Implement optimizer for retraction and turn on retraction for over window aggregate

2017-03-16 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6090:


 Summary: Implement optimizer for retraction and turn on retraction 
for over window aggregate
 Key: FLINK-6090
 URL: https://issues.apache.org/jira/browse/FLINK-6090
 Project: Flink
  Issue Type: Sub-task
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Implement optimizer for retraction and turn on the retraction for over window 
as the first prototype example:
  1.Add RetractionRule at the stage of decoration,which can derive the replace 
table/append table, NeedRetraction property.
  2. Match the NeedRetraction and replace table, mark the accumulating mode; 
Add the necessary retract generate function at the replace table, and add the 
retract process logic at the retract consumer
  3. turn on retraction for over window aggregate




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6089) Implement decoration phase for rewriting predicated logical plan after volcano optimization phase

2017-03-16 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6089:


 Summary: Implement decoration phase for rewriting predicated 
logical plan after volcano optimization phase
 Key: FLINK-6089
 URL: https://issues.apache.org/jira/browse/FLINK-6089
 Project: Flink
  Issue Type: Sub-task
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


At present, there is no chance to modify the DataStreamRel tree after the 
volcano optimization. We consider to add a decoration phase after volcano 
optimization phase. Decoration phase is dedicated for rewriting predicated 
logical plan and is independent of cost module. After decoration phase is 
added, we get the chance to apply retraction rules at this phase.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6047) Master Jira for "Retraction for Flink Streaming"

2017-03-14 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-6047:


 Summary: Master Jira for "Retraction for Flink Streaming"
 Key: FLINK-6047
 URL: https://issues.apache.org/jira/browse/FLINK-6047
 Project: Flink
  Issue Type: New Feature
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


[Design doc]:
https://docs.google.com/document/d/18XlGPcfsGbnPSApRipJDLPg5IFNGTQjnz7emkVpZlkw

[Introduction]:
"Retraction" is an important building block for data streaming to refine the 
early fired results in streaming. “Early firing” are very common and widely 
used in many streaming scenarios, for instance “window-less” or unbounded 
aggregate and stream-stream inner join, windowed (with early firing) aggregate 
and stream-stream inner join. There are mainly two cases that require 
retractions: 1) update on the keyed table (the key is either a primaryKey (PK) 
on source table, or a groupKey/partitionKey in an aggregate); 2) When dynamic 
windows (e.g., session window) are in use, the new value may be replacing more 
than one previous window due to window merging. 

To the best of our knowledge, the retraction for the early fired streaming 
results has never been practically solved before. In this proposal, we develop 
a retraction solution and explain how it works for the problem of “update on 
the keyed table”. The same solution can be easily extended for the dynamic 
windows merging, as the key component of retraction - how to refine an early 
fired results - is the same across different problems.  

[Proposed Jiras]:
Implement decoration phase for predicated logical plan rewriting after volcano 
optimization phase
Add source with table primary key and replace table property
Add sink tableInsert and NeedRetract property
Implement the retraction for partitioned unbounded over window aggregate
Implement the retraction for stream-stream inner join
Implement the retraction for the early firing window
Implement the retraction for the dynamic window with early firing





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-5984) Add resetAccumulator method for AggregateFunction

2017-03-08 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-5984:
-
Summary: Add resetAccumulator method for AggregateFunction  (was: Allow 
reusing of accumulators in AggregateFunction)

> Add resetAccumulator method for AggregateFunction
> -
>
> Key: FLINK-5984
> URL: https://issues.apache.org/jira/browse/FLINK-5984
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Timo Walther
>Assignee: Shaoxuan Wang
>
> Right now we have to create a new accumulator object if we just want to reset 
> it. We should allow passing the old one as a {{reuse}} object to 
> {{AggregateFunction#createAccumulator}}. The aggregate function then can 
> decide if it wants to create a new object or reset the old one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-5984) Allow reusing of accumulators in AggregateFunction

2017-03-08 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-5984:


Assignee: Shaoxuan Wang

> Allow reusing of accumulators in AggregateFunction
> --
>
> Key: FLINK-5984
> URL: https://issues.apache.org/jira/browse/FLINK-5984
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Timo Walther
>Assignee: Shaoxuan Wang
>
> Right now we have to create a new accumulator object if we just want to reset 
> it. We should allow passing the old one as a {{reuse}} object to 
> {{AggregateFunction#createAccumulator}}. The aggregate function then can 
> decide if it wants to create a new object or reset the old one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5984) Allow reusing of accumulators in AggregateFunction

2017-03-07 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900597#comment-15900597
 ] 

Shaoxuan Wang commented on FLINK-5984:
--

[~twalthr] Thanks for the suggestion. This seems a new "resetAccumulator" 
method to me. It is valuable for dataset, while dataStream does not need this 
for now. 
  override def resetAccumulator(acc: Accumulator) = {
val a = acc.asInstanceOf[SumAccumulator[T]]
a.f0 = numeric.zero //sum
a.f1 = false
  }
Fabian's proposal also looks good to me, as long as we make the "reuse 
parameter" purely clear to the users with detailed annotations.
[~twalthr], if you have not started on this jira, I can help to make the 
changes.

> Allow reusing of accumulators in AggregateFunction
> --
>
> Key: FLINK-5984
> URL: https://issues.apache.org/jira/browse/FLINK-5984
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Timo Walther
>
> Right now we have to create a new accumulator object if we just want to reset 
> it. We should allow passing the old one as a {{reuse}} object to 
> {{AggregateFunction#createAccumulator}}. The aggregate function then can 
> decide if it wants to create a new object or reset the old one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Issue Comment Deleted] (FLINK-5983) Replace for/foreach/map in aggregates by while loops

2017-03-07 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-5983:
-
Comment: was deleted

(was: ok, the plan sounds good to me. The runtime processing function for 
aggregate will be codeGened, and it is user's call to use java or scala user 
defined aggregate functions  and if it is written in scala the user is 
responsible for the performance.)

> Replace for/foreach/map in aggregates by while loops
> 
>
> Key: FLINK-5983
> URL: https://issues.apache.org/jira/browse/FLINK-5983
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Timo Walther
>
> Right now there is a mixture of different kinds of loops within aggregate 
> functions. Although performance is not the main goal at the moment, we should 
> focus on performant execution especially in this runtime functions.
> e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}}
> We should replace loops, maps etc. by primitive while loops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5983) Replace for/foreach/map in aggregates by while loops

2017-03-07 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899620#comment-15899620
 ] 

Shaoxuan Wang commented on FLINK-5983:
--

ok, the plan sounds good to me. The runtime processing function for aggregate 
will be codeGened, and it is user's call to use java or scala user defined 
aggregate functions  and if it is written in scala the user is responsible 
for the performance.

> Replace for/foreach/map in aggregates by while loops
> 
>
> Key: FLINK-5983
> URL: https://issues.apache.org/jira/browse/FLINK-5983
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Timo Walther
>
> Right now there is a mixture of different kinds of loops within aggregate 
> functions. Although performance is not the main goal at the moment, we should 
> focus on performant execution especially in this runtime functions.
> e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}}
> We should replace loops, maps etc. by primitive while loops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5983) Replace for/foreach/map in aggregates by while loops

2017-03-07 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899619#comment-15899619
 ] 

Shaoxuan Wang commented on FLINK-5983:
--

ok, the plan sounds good to me. The runtime processing function for aggregate 
will be codeGened, and it is user's call to use java or scala user defined 
aggregate functions  and if it is written in scala the user is responsible 
for the performance.

> Replace for/foreach/map in aggregates by while loops
> 
>
> Key: FLINK-5983
> URL: https://issues.apache.org/jira/browse/FLINK-5983
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Timo Walther
>
> Right now there is a mixture of different kinds of loops within aggregate 
> functions. Although performance is not the main goal at the moment, we should 
> focus on performant execution especially in this runtime functions.
> e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}}
> We should replace loops, maps etc. by primitive while loops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5983) Replace for/foreach/map in aggregates by while loops

2017-03-07 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899594#comment-15899594
 ] 

Shaoxuan Wang commented on FLINK-5983:
--

Can we consider implementing the runtime code and even the built-in aggregate 
functions in Java

> Replace for/foreach/map in aggregates by while loops
> 
>
> Key: FLINK-5983
> URL: https://issues.apache.org/jira/browse/FLINK-5983
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Timo Walther
>
> Right now there is a mixture of different kinds of loops within aggregate 
> functions. Although performance is not the main goal at the moment, we should 
> focus on performant execution especially in this runtime functions.
> e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}}
> We should replace loops, maps etc. by primitive while loops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5983) Replace for/foreach/map in aggregates by while loops

2017-03-07 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899577#comment-15899577
 ] 

Shaoxuan Wang commented on FLINK-5983:
--

 I love this proposal to replace scala for loops by while loops in the runtime 
functions. Thanks [~twalthr].

> Replace for/foreach/map in aggregates by while loops
> 
>
> Key: FLINK-5983
> URL: https://issues.apache.org/jira/browse/FLINK-5983
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Timo Walther
>
> Right now there is a mixture of different kinds of loops within aggregate 
> functions. Although performance is not the main goal at the moment, we should 
> focus on performant execution especially in this runtime functions.
> e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}}
> We should replace loops, maps etc. by primitive while loops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-5655) Add event time OVER RANGE BETWEEN x PRECEDING aggregation to SQL

2017-03-06 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-5655:


Assignee: sunjincheng  (was: Shaoxuan Wang)

> Add event time OVER RANGE BETWEEN x PRECEDING aggregation to SQL
> 
>
> Key: FLINK-5655
> URL: https://issues.apache.org/jira/browse/FLINK-5655
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Fabian Hueske
>Assignee: sunjincheng
>
> The goal of this issue is to add support for OVER RANGE aggregations on event 
> time streams to the SQL interface.
> Queries similar to the following should be supported:
> {code}
> SELECT 
>   a, 
>   SUM(b) OVER (PARTITION BY c ORDER BY rowTime() RANGE BETWEEN INTERVAL '1' 
> HOUR PRECEDING AND CURRENT ROW) AS sumB,
>   MIN(b) OVER (PARTITION BY c ORDER BY rowTime() RANGE BETWEEN INTERVAL '1' 
> HOUR PRECEDING AND CURRENT ROW) AS minB
> FROM myStream
> {code}
> The following restrictions should initially apply:
> - All OVER clauses in the same SELECT clause must be exactly the same.
> - The PARTITION BY clause is optional (no partitioning results in single 
> threaded execution).
> - The ORDER BY clause may only have rowTime() as parameter. rowTime() is a 
> parameterless scalar function that just indicates processing time mode.
> - UNBOUNDED PRECEDING is not supported (see FLINK-5658)
> - FOLLOWING is not supported.
> The restrictions will be resolved in follow up issues. If we find that some 
> of the restrictions are trivial to address, we can add the functionality in 
> this issue as well.
> This issue includes:
> - Design of the DataStream operator to compute OVER ROW aggregates
> - Translation from Calcite's RelNode representation (LogicalProject with 
> RexOver expression).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5655) Add event time OVER RANGE BETWEEN x PRECEDING aggregation to SQL

2017-03-06 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898814#comment-15898814
 ] 

Shaoxuan Wang commented on FLINK-5655:
--

[~sunjincheng121] and I want to start the implementation on this event-time 
bounded over window. We will start with a design doc. Reassigning the task to 
sunjincheng121.

> Add event time OVER RANGE BETWEEN x PRECEDING aggregation to SQL
> 
>
> Key: FLINK-5655
> URL: https://issues.apache.org/jira/browse/FLINK-5655
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Fabian Hueske
>Assignee: sunjincheng
>
> The goal of this issue is to add support for OVER RANGE aggregations on event 
> time streams to the SQL interface.
> Queries similar to the following should be supported:
> {code}
> SELECT 
>   a, 
>   SUM(b) OVER (PARTITION BY c ORDER BY rowTime() RANGE BETWEEN INTERVAL '1' 
> HOUR PRECEDING AND CURRENT ROW) AS sumB,
>   MIN(b) OVER (PARTITION BY c ORDER BY rowTime() RANGE BETWEEN INTERVAL '1' 
> HOUR PRECEDING AND CURRENT ROW) AS minB
> FROM myStream
> {code}
> The following restrictions should initially apply:
> - All OVER clauses in the same SELECT clause must be exactly the same.
> - The PARTITION BY clause is optional (no partitioning results in single 
> threaded execution).
> - The ORDER BY clause may only have rowTime() as parameter. rowTime() is a 
> parameterless scalar function that just indicates processing time mode.
> - UNBOUNDED PRECEDING is not supported (see FLINK-5658)
> - FOLLOWING is not supported.
> The restrictions will be resolved in follow up issues. If we find that some 
> of the restrictions are trivial to address, we can add the functionality in 
> this issue as well.
> This issue includes:
> - Design of the DataStream operator to compute OVER ROW aggregates
> - Translation from Calcite's RelNode representation (LogicalProject with 
> RexOver expression).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5956) Add retract method into the aggregateFunction

2017-03-03 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5956:


 Summary: Add retract method into the aggregateFunction
 Key: FLINK-5956
 URL: https://issues.apache.org/jira/browse/FLINK-5956
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Retraction method is help for processing updated message. It will also very 
helpful for window Aggregation. This PR will first add retraction methods into 
the aggregateFunctions, such that on-going over window Aggregation can get 
benefit from it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5955) Merging a list of buffered records will have problem when ObjectReuse is turned on

2017-03-02 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5955:


 Summary: Merging a list of buffered records will have problem when 
ObjectReuse is turned on
 Key: FLINK-5955
 URL: https://issues.apache.org/jira/browse/FLINK-5955
 Project: Flink
  Issue Type: Bug
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Turn on ObjectReuse in MultipleProgramsTestBase:
TestEnvironment clusterEnv = new TestEnvironment(cluster, 4, true);

Then the tests "testEventTimeSessionGroupWindow", 
"testEventTimeSessionGroupWindow", and 
"testEventTimeTumblingGroupWindowOverTime"  will fail.

The reason is that we have buffered iterated records for group-merge. I think 
we should change the Agg merge to pair-merge, and later add group-merge when 
needed (in the future we should add rules to select either pair-merge or 
group-merge, but for now all built-in aggregates should work fine with 
pair-merge).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5927) Remove old Aggregate interface and built-in functions

2017-02-27 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5927:


 Summary: Remove old Aggregate interface and built-in functions
 Key: FLINK-5927
 URL: https://issues.apache.org/jira/browse/FLINK-5927
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-5768) Apply new aggregation functions for datastream and dataset tables

2017-02-26 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-5768:
-
Description: 
Apply new aggregation functions for datastream and dataset tables

This includes:
1. Change the implementation of the DataStream aggregation runtime code to use 
new aggregation functions and aggregate dataStream API.
2. DataStream will be always running in incremental mode, as explained in 
06/Feb/2017 in FLINK5564.
2. Change the implementation of the Dataset aggregation runtime code to use new 
aggregation functions.
3. Clean up unused class and method.


  was:Change the implementation of the DataStream aggregation runtime code to 
use new aggregation functions and aggregate dataStream API.


> Apply new aggregation functions for datastream and dataset tables
> -
>
> Key: FLINK-5768
> URL: https://issues.apache.org/jira/browse/FLINK-5768
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> Apply new aggregation functions for datastream and dataset tables
> This includes:
> 1. Change the implementation of the DataStream aggregation runtime code to 
> use new aggregation functions and aggregate dataStream API.
> 2. DataStream will be always running in incremental mode, as explained in 
> 06/Feb/2017 in FLINK5564.
> 2. Change the implementation of the Dataset aggregation runtime code to use 
> new aggregation functions.
> 3. Clean up unused class and method.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-5768) Apply new aggregation functions for datastream and dataset tables

2017-02-26 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-5768:
-
Summary: Apply new aggregation functions for datastream and dataset tables  
(was: Apply new aggregation functions and aggregate DataStream API for 
streaming tables)

> Apply new aggregation functions for datastream and dataset tables
> -
>
> Key: FLINK-5768
> URL: https://issues.apache.org/jira/browse/FLINK-5768
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> Change the implementation of the DataStream aggregation runtime code to use 
> new aggregation functions and aggregate dataStream API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5564) User Defined Aggregates

2017-02-24 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884070#comment-15884070
 ] 

Shaoxuan Wang commented on FLINK-5564:
--

It is hard to separate the PRs for dataStream (FLINK-5768) and dataSet 
(FLINK-5769). The method we used to translate an aggregate to the Aggregate 
Function(transformToAggregateFunctions) is regardless of dataSet or dataStream. 
I would like to merge these two tasks and just provide one PR.

> User Defined Aggregates
> ---
>
> Key: FLINK-5564
> URL: https://issues.apache.org/jira/browse/FLINK-5564
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> User-defined aggregates would be a great addition to the Table API / SQL.
> The current aggregate interface is not well suited for the external users.  
> This issue proposes to redesign the aggregate such that we can expose an 
> better external UDAGG interface to the users. The detailed design proposal 
> can be found here: 
> https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit
> Motivation:
> 1. The current aggregate interface is not very concise to the users. One 
> needs to know the design details of the intermediate Row buffer before 
> implements an Aggregate. Seven functions are needed even for a simple Count 
> aggregate.
> 2. Another limitation of current aggregate function is that it can only be 
> applied on one single column. There are many scenarios which require the 
> aggregate function taking multiple columns as the inputs.
> 3. “Retraction” is not considered and covered in the current Aggregate.
> 4. It might be very good to have a local/global aggregate query plan 
> optimization, which is very promising to optimize UDAGG performance in some 
> scenarios.
> Proposed Changes:
> 1. Implement an aggregate dataStream API (Done by 
> [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582])
> 2. Update all the existing aggregates to use the new aggregate dataStream API
> 3. Provide a better User-Defined Aggregate interface
> 4. Add retraction support
> 5. Add local/global aggregate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5915) Add support for the aggregate on multi fields

2017-02-24 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5915:


 Summary: Add support for the aggregate on multi fields
 Key: FLINK-5915
 URL: https://issues.apache.org/jira/browse/FLINK-5915
 Project: Flink
  Issue Type: Sub-task
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


some UDAGGs have multi-fields as input. For instance,
table
.window(Tumble over 10.minutes on 'rowtime as 'w )
.groupBy('key, 'w)
.select('key, weightedAvg('value, 'weight))

This task will add the support for the aggregate on multi fields.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5914) remove aggregateResultType from streaming.api.datastream.aggregate

2017-02-24 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5914:


 Summary: remove aggregateResultType from 
streaming.api.datastream.aggregate
 Key: FLINK-5914
 URL: https://issues.apache.org/jira/browse/FLINK-5914
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


aggregateResultType does not seem necessary for 
streaming.api.datastream.aggregate. We will anyway not serialize the 
aggregateResult between aggregate and window function. Aggregate function 
itself provides a function to getResult(), window function here should just 
emit the same results as aggregate output. So aggregateResultType should be 
same as resultType. I think we can safely remove aggregateResultType, thereby 
user will not have to provide two same types for the 
streaming.api.datastream.aggregate.  

 [~StephanEwen], what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-5906) Add support to register UDAGGs in TableEnvironment

2017-02-24 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang reassigned FLINK-5906:


Assignee: Shaoxuan Wang

> Add support to register UDAGGs in TableEnvironment
> --
>
> Key: FLINK-5906
> URL: https://issues.apache.org/jira/browse/FLINK-5906
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Fabian Hueske
>Assignee: Shaoxuan Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-5899) Fix the bug in EventTimeTumblingWindow for non-partialMerge aggregate

2017-02-23 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-5899:
-
Summary: Fix the bug in EventTimeTumblingWindow for non-partialMerge 
aggregate  (was: Fix the bug in initializing the 
DataSetTumbleTimeWindowAggReduceGroupFunction)

> Fix the bug in EventTimeTumblingWindow for non-partialMerge aggregate
> -
>
> Key: FLINK-5899
> URL: https://issues.apache.org/jira/browse/FLINK-5899
> Project: Flink
>  Issue Type: Bug
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> The row length used to initialize 
> DataSetTumbleTimeWindowAggReduceGroupFunction was not set properly. (I think 
> this is introduced by mistake when merging the code).
> We currently lack the built-in non-partial-merge Aggregates. Therefore this 
> has not been captured by the unit test. 
> Reproduce step:
> 1. set the "supportPartial" to false for SumAggregate
> 2. Then both testAllEventTimeTumblingWindowOverTime and 
> testEventTimeTumblingGroupWindowOverTime will fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5900) Add non-partial merge Aggregates and unit tests

2017-02-23 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5900:


 Summary: Add non-partial merge Aggregates and unit tests
 Key: FLINK-5900
 URL: https://issues.apache.org/jira/browse/FLINK-5900
 Project: Flink
  Issue Type: Improvement
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Current built-in aggregates all support partial-merge. We are blind and not 
sure if the non-partial aggregate works or not. We should add non-partial merge 
Aggregates and unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5899) Fix the bug in initializing the DataSetTumbleTimeWindowAggReduceGroupFunction

2017-02-23 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5899:


 Summary: Fix the bug in initializing the 
DataSetTumbleTimeWindowAggReduceGroupFunction
 Key: FLINK-5899
 URL: https://issues.apache.org/jira/browse/FLINK-5899
 Project: Flink
  Issue Type: Bug
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


The row length used to initialize DataSetTumbleTimeWindowAggReduceGroupFunction 
was not set properly. (I think this is introduced by mistake when merging the 
code).
We currently lack the built-in non-partial-merge Aggregates. Therefore this has 
not been captured by the unit test. 

Reproduce step:
1. set the "supportPartial" to false for SumAggregate
2. Then both testAllEventTimeTumblingWindowOverTime and 
testEventTimeTumblingGroupWindowOverTime will fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5878) Add stream-stream inner join on TableAPI

2017-02-21 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5878:


 Summary: Add stream-stream inner join on TableAPI
 Key: FLINK-5878
 URL: https://issues.apache.org/jira/browse/FLINK-5878
 Project: Flink
  Issue Type: New Feature
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


This task is intended to support stream-stream inner join on tableAPI. A brief 
design doc is created: 
https://docs.google.com/document/d/10oJDw-P9fImD5tc3Stwr7aGIhvem4l8uB2bjLzK72u4/edit

We propose to use the mapState as the backend state interface for this "join" 
operator, so this task requires FLINK-4856.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-5767) New aggregate function interface and built-in aggregate functions

2017-02-18 Thread Shaoxuan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaoxuan Wang updated FLINK-5767:
-
Summary: New aggregate function interface and built-in aggregate functions  
(was: Add a new aggregate function interface)

> New aggregate function interface and built-in aggregate functions
> -
>
> Key: FLINK-5767
> URL: https://issues.apache.org/jira/browse/FLINK-5767
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> Add a new aggregate function interface. This includes implementing the 
> aggregate interface, migrating the existing aggregation functions to this 
> interface, and adding the unit tests for these functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5813) code generation for user-defined aggregate functions

2017-02-15 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5813:


 Summary: code generation for user-defined aggregate functions
 Key: FLINK-5813
 URL: https://issues.apache.org/jira/browse/FLINK-5813
 Project: Flink
  Issue Type: Sub-task
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


The input and return types of the new proposed UDAGG functions are dynamically 
given by the users. All these user defined functions have to be generated via 
codegen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5768) Apply new aggregation functions and aggregate DataStream API for streaming tables

2017-02-09 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5768:


 Summary: Apply new aggregation functions and aggregate DataStream 
API for streaming tables
 Key: FLINK-5768
 URL: https://issues.apache.org/jira/browse/FLINK-5768
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Change the implementation of the DataStream aggregation runtime code to use new 
aggregation functions and aggregate dataStream API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5769) Apply new aggregation functions for dataset tables

2017-02-09 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5769:


 Summary: Apply new aggregation functions for dataset tables
 Key: FLINK-5769
 URL: https://issues.apache.org/jira/browse/FLINK-5769
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Change the implementation of the Dataset aggregation runtime code to use new 
aggregation functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5767) Add a new aggregate function interface

2017-02-09 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5767:


 Summary: Add a new aggregate function interface
 Key: FLINK-5767
 URL: https://issues.apache.org/jira/browse/FLINK-5767
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Add a new aggregate function interface. This includes implementing the 
aggregate interface, migrating the existing aggregation functions to this 
interface, and adding the unit tests for these functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5564) User Defined Aggregates

2017-02-08 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859039#comment-15859039
 ] 

Shaoxuan Wang edited comment on FLINK-5564 at 2/9/17 5:16 AM:
--

Thanks [~fhueske], 
Absolutely, I agree with you that it is better to separate the huge PR into 
some ones. (merge 1,2,3 will lead to more than 3K lines change)
But I am afraid I did not completely get your suggested #1. Migrating the 
existing Agg without changing runtime code will lead to all Integration Test 
fail. One possible way is that I create new interface (say AggregateFunction) 
and create a few Aggs which is implemented from new interface (say intAgg 
extends AggregateFunction), and in step #1, I just add queryPlan tests, like 
what we usually did in GroupWindowTest. Is this what you are suggesting.


was (Author: shaoxuanwang):
Thanks [~fhueske], 
Absolutely, I agree with you that it is better to separate the huge PR into 
some ones. (merge 1,2,3 will lead to more than 3K lines change)
But I am afraid I did not completely get your suggested #1. Migrating the 
existing Agg without changing runtime code will lead to all IntergrationTest 
fail. One possible way is that I create new interface (say AggregateFunction) 
and create a few Aggs which is implemented from new interface (say intAgg 
extends AggregateFunction), and in step #1, I just add queryPlan tests, like 
what we usually did in GroupWindowTest. Is this what you are suggesting.

> User Defined Aggregates
> ---
>
> Key: FLINK-5564
> URL: https://issues.apache.org/jira/browse/FLINK-5564
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> User-defined aggregates would be a great addition to the Table API / SQL.
> The current aggregate interface is not well suited for the external users.  
> This issue proposes to redesign the aggregate such that we can expose an 
> better external UDAGG interface to the users. The detailed design proposal 
> can be found here: 
> https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit
> Motivation:
> 1. The current aggregate interface is not very concise to the users. One 
> needs to know the design details of the intermediate Row buffer before 
> implements an Aggregate. Seven functions are needed even for a simple Count 
> aggregate.
> 2. Another limitation of current aggregate function is that it can only be 
> applied on one single column. There are many scenarios which require the 
> aggregate function taking multiple columns as the inputs.
> 3. “Retraction” is not considered and covered in the current Aggregate.
> 4. It might be very good to have a local/global aggregate query plan 
> optimization, which is very promising to optimize UDAGG performance in some 
> scenarios.
> Proposed Changes:
> 1. Implement an aggregate dataStream API (Done by 
> [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582])
> 2. Update all the existing aggregates to use the new aggregate dataStream API
> 3. Provide a better User-Defined Aggregate interface
> 4. Add retraction support
> 5. Add local/global aggregate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5564) User Defined Aggregates

2017-02-08 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859039#comment-15859039
 ] 

Shaoxuan Wang edited comment on FLINK-5564 at 2/9/17 5:16 AM:
--

Thanks [~fhueske], 
Absolutely, I agree with you that it is better to separate the huge PR into 
some ones. (merge 1,2,3 will lead to more than 3K lines change)
But I am afraid I did not completely get your suggested #1. Migrating the 
existing Agg without changing runtime code will lead to all IntergrationTest 
fail. One possible way is that I create new interface (say AggregateFunction) 
and create a few Aggs which is implemented from new interface (say intAgg 
extends AggregateFunction), and in step #1, I just add queryPlan tests, like 
what we usually did in GroupWindowTest. Is this what you are suggesting.


was (Author: shaoxuanwang):
Thanks [~fhueske], 
Obsoletely, I agree with you that it is better to separate the huge PR into 
some ones. (merge 1,2,3 will lead to more than 3K lines change)
But I am afraid I did not completely get your suggested #1. Migrating the 
existing Agg without changing runtime code will lead to all IntergrationTest 
fail. One possible way is that I create new interface (say AggregateFunction) 
and create a few Aggs which is implemented from new interface (say intAgg 
extends AggregateFunction), and in step #1, I just add queryPlan tests, like 
what we usually did in GroupWindowTest. Is this what you are suggesting.

> User Defined Aggregates
> ---
>
> Key: FLINK-5564
> URL: https://issues.apache.org/jira/browse/FLINK-5564
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> User-defined aggregates would be a great addition to the Table API / SQL.
> The current aggregate interface is not well suited for the external users.  
> This issue proposes to redesign the aggregate such that we can expose an 
> better external UDAGG interface to the users. The detailed design proposal 
> can be found here: 
> https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit
> Motivation:
> 1. The current aggregate interface is not very concise to the users. One 
> needs to know the design details of the intermediate Row buffer before 
> implements an Aggregate. Seven functions are needed even for a simple Count 
> aggregate.
> 2. Another limitation of current aggregate function is that it can only be 
> applied on one single column. There are many scenarios which require the 
> aggregate function taking multiple columns as the inputs.
> 3. “Retraction” is not considered and covered in the current Aggregate.
> 4. It might be very good to have a local/global aggregate query plan 
> optimization, which is very promising to optimize UDAGG performance in some 
> scenarios.
> Proposed Changes:
> 1. Implement an aggregate dataStream API (Done by 
> [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582])
> 2. Update all the existing aggregates to use the new aggregate dataStream API
> 3. Provide a better User-Defined Aggregate interface
> 4. Add retraction support
> 5. Add local/global aggregate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5564) User Defined Aggregates

2017-02-08 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859039#comment-15859039
 ] 

Shaoxuan Wang commented on FLINK-5564:
--

Thanks [~fhueske], 
Obsoletely, I agree with you that it is better to separate the huge PR into 
some ones. (merge 1,2,3 will lead to more than 3K lines change)
But I am afraid I did not completely get your suggested #1. Migrating the 
existing Agg without changing runtime code will lead to all IntergrationTest 
fail. One possible way is that I create new interface (say AggregateFunction) 
and create a few Aggs which is implemented from new interface (say intAgg 
extends AggregateFunction), and in step #1, I just add queryPlan tests, like 
what we usually did in GroupWindowTest. Is this what you are suggesting.

> User Defined Aggregates
> ---
>
> Key: FLINK-5564
> URL: https://issues.apache.org/jira/browse/FLINK-5564
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> User-defined aggregates would be a great addition to the Table API / SQL.
> The current aggregate interface is not well suited for the external users.  
> This issue proposes to redesign the aggregate such that we can expose an 
> better external UDAGG interface to the users. The detailed design proposal 
> can be found here: 
> https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit
> Motivation:
> 1. The current aggregate interface is not very concise to the users. One 
> needs to know the design details of the intermediate Row buffer before 
> implements an Aggregate. Seven functions are needed even for a simple Count 
> aggregate.
> 2. Another limitation of current aggregate function is that it can only be 
> applied on one single column. There are many scenarios which require the 
> aggregate function taking multiple columns as the inputs.
> 3. “Retraction” is not considered and covered in the current Aggregate.
> 4. It might be very good to have a local/global aggregate query plan 
> optimization, which is very promising to optimize UDAGG performance in some 
> scenarios.
> Proposed Changes:
> 1. Implement an aggregate dataStream API (Done by 
> [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582])
> 2. Update all the existing aggregates to use the new aggregate dataStream API
> 3. Provide a better User-Defined Aggregate interface
> 4. Add retraction support
> 5. Add local/global aggregate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5564) User Defined Aggregates

2017-02-06 Thread Shaoxuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854168#comment-15854168
 ] 

Shaoxuan Wang edited comment on FLINK-5564 at 2/6/17 3:02 PM:
--

Hi all, as also discussed in the dev email thread for the FLIP11 over window 
design, when I work on refactoring the streaming query plan, I feel that we do 
not need to keep the non-incremental query plan for streaming aggregation, as 
all the streaming aggregation should be suitable for incremental aggregate 
(even for max, min and median). One can choose to accumulate all records at the 
same time when the window is completed. But it will still execute the 
accumulate method to update the accumulator state for each record. The way it 
executes accumulate function to accumulate each record already implies that the 
aggregation is incremental. Whether it is accumulated once at each record 
arrival (incremental) or accumulated all records when the window is completed 
(non-incremental), really does not matter in terms of the correctness and the 
complexity. On the other hand, the non-incremental approach will introduce CPU 
jitter and latency overhead, so I would like to propose to always apply 
incremental mode for all streaming aggregations. 



was (Author: shaoxuanwang):
Hi all, as also discussed in the dev email thread for the FLIP11 over window 
design, when I work on refactoring the streaming query plan, we do not need to 
keep the non-incremental query plan for streaming aggregation, as all the 
streaming aggregation should be suitable for incremental aggregate (even for 
max, min and median). One can choose to accumulate all records at the same time 
when the window is completed. But it will still execute the accumulate method 
to update the accumulator state for each record. The way it executes accumulate 
function to accumulate each record already implies that the aggregation is 
incremental. Whether it is accumulated once at each record arrival 
(incremental) or accumulated all records when the window is completed 
(non-incremental), really does not matter in terms of the correctness and the 
complexity. On the other hand, the non-incremental approach will introduce CPU 
jitter and latency overhead, so I would like to propose to always apply 
incremental mode for all streaming aggregations. 


> User Defined Aggregates
> ---
>
> Key: FLINK-5564
> URL: https://issues.apache.org/jira/browse/FLINK-5564
> Project: Flink
>  Issue Type: Improvement
>  Components: Table API & SQL
>Reporter: Shaoxuan Wang
>Assignee: Shaoxuan Wang
>
> User-defined aggregates would be a great addition to the Table API / SQL.
> The current aggregate interface is not well suited for the external users.  
> This issue proposes to redesign the aggregate such that we can expose an 
> better external UDAGG interface to the users. The detailed design proposal 
> can be found here: 
> https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit
> Motivation:
> 1. The current aggregate interface is not very concise to the users. One 
> needs to know the design details of the intermediate Row buffer before 
> implements an Aggregate. Seven functions are needed even for a simple Count 
> aggregate.
> 2. Another limitation of current aggregate function is that it can only be 
> applied on one single column. There are many scenarios which require the 
> aggregate function taking multiple columns as the inputs.
> 3. “Retraction” is not considered and covered in the current Aggregate.
> 4. It might be very good to have a local/global aggregate query plan 
> optimization, which is very promising to optimize UDAGG performance in some 
> scenarios.
> Proposed Changes:
> 1. Implement an aggregate dataStream API (Done by 
> [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582])
> 2. Update all the existing aggregates to use the new aggregate dataStream API
> 3. Provide a better User-Defined Aggregate interface
> 4. Add retraction support
> 5. Add local/global aggregate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   >