[jira] [Commented] (FLINK-5888) ForwardedFields annotation is not generating optimised execution plan in example KMeans job

2021-04-29 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336798#comment-17336798
 ] 

Flink Jira Bot commented on FLINK-5888:
---

This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> ForwardedFields annotation is not generating optimised execution plan in 
> example KMeans job
> ---
>
> Key: FLINK-5888
> URL: https://issues.apache.org/jira/browse/FLINK-5888
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Examples
>Affects Versions: 1.1.3
>Reporter: Ziyad Muhammed Mohiyudheen
>Priority: Major
>  Labels: stale-major
>
> Flink KMeans java example [1] shows the usage of ForwardedFields function 
> annotation. How ever, the example job was taking more time than expected on 
> medium sized data itself. By merely removing the function annotation from the 
> example code (with out any other change), a better execution plan and run 
> time was obtained. The execution plan shows that no combiner is used and the 
> two Map tasks are not chained when ForwardedFields is enabled. The experiment 
> is documented in [2]
> [1] 
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
> [2] https://drive.google.com/open?id=0B0IlZv0uHBuvVEZ5ZmNpN19jVVU



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5888) ForwardedFields annotation is not generating optimised execution plan in example KMeans job

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328825#comment-17328825
 ] 

Flink Jira Bot commented on FLINK-5888:
---

This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> ForwardedFields annotation is not generating optimised execution plan in 
> example KMeans job
> ---
>
> Key: FLINK-5888
> URL: https://issues.apache.org/jira/browse/FLINK-5888
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataSet, Examples
>Affects Versions: 1.1.3
>Reporter: Ziyad Muhammed Mohiyudheen
>Priority: Major
>  Labels: stale-major
>
> Flink KMeans java example [1] shows the usage of ForwardedFields function 
> annotation. How ever, the example job was taking more time than expected on 
> medium sized data itself. By merely removing the function annotation from the 
> example code (with out any other change), a better execution plan and run 
> time was obtained. The execution plan shows that no combiner is used and the 
> two Map tasks are not chained when ForwardedFields is enabled. The experiment 
> is documented in [2]
> [1] 
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
> [2] https://drive.google.com/open?id=0B0IlZv0uHBuvVEZ5ZmNpN19jVVU



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5888) ForwardedFields annotation is not generating optimised execution plan in example KMeans job

2017-02-23 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880265#comment-15880265
 ] 

Fabian Hueske commented on FLINK-5888:
--

Yes, that was expected since the partitioning was moved between the maps as 
[~ggevay] pointed out.
I think there is an actual problem in the optimizer.

Thanks for reporting this issue!

> ForwardedFields annotation is not generating optimised execution plan in 
> example KMeans job
> ---
>
> Key: FLINK-5888
> URL: https://issues.apache.org/jira/browse/FLINK-5888
> Project: Flink
>  Issue Type: Bug
>  Components: DataSet API, Examples, Java API
>Affects Versions: 1.1.3
>Reporter: Ziyad Muhammed Mohiyudheen
>
> Flink KMeans java example [1] shows the usage of ForwardedFields function 
> annotation. How ever, the example job was taking more time than expected on 
> medium sized data itself. By merely removing the function annotation from the 
> example code (with out any other change), a better execution plan and run 
> time was obtained. The execution plan shows that no combiner is used and the 
> two Map tasks are not chained when ForwardedFields is enabled. The experiment 
> is documented in [2]
> [1] 
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
> [2] https://drive.google.com/open?id=0B0IlZv0uHBuvVEZ5ZmNpN19jVVU



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5888) ForwardedFields annotation is not generating optimised execution plan in example KMeans job

2017-02-22 Thread Ziyad Muhammed Mohiyudheen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879605#comment-15879605
 ] 

Ziyad Muhammed Mohiyudheen commented on FLINK-5888:
---

[~fhueske], Here is the output for the cluster run:
https://drive.google.com/file/d/0B0IlZv0uHBuvUUlBUndld3BKc0E
The input data size is ~1.2G. The job finishes much faster when run without 
Annotations. Also, the statistics shows that huge amount of intermediate data 
is generated while run with Annotations enabled.


> ForwardedFields annotation is not generating optimised execution plan in 
> example KMeans job
> ---
>
> Key: FLINK-5888
> URL: https://issues.apache.org/jira/browse/FLINK-5888
> Project: Flink
>  Issue Type: Bug
>  Components: DataSet API, Examples, Java API
>Affects Versions: 1.1.3
>Reporter: Ziyad Muhammed Mohiyudheen
>
> Flink KMeans java example [1] shows the usage of ForwardedFields function 
> annotation. How ever, the example job was taking more time than expected on 
> medium sized data itself. By merely removing the function annotation from the 
> example code (with out any other change), a better execution plan and run 
> time was obtained. The execution plan shows that no combiner is used and the 
> two Map tasks are not chained when ForwardedFields is enabled. The experiment 
> is documented in [2]
> [1] 
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
> [2] https://drive.google.com/open?id=0B0IlZv0uHBuvVEZ5ZmNpN19jVVU



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5888) ForwardedFields annotation is not generating optimised execution plan in example KMeans job

2017-02-22 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879297#comment-15879297
 ] 

Fabian Hueske commented on FLINK-5888:
--

Yes, [~ggevay] is right. The hash partitioning is moved between the two maps 
(which is semantically OK due to the annotations).
Without debugging the optimizer, I see two reasons why that happened:

1. Both plans have identical unknown estimated costs because no stats are 
available.
2. The plan without Combiner has lower estimated costs (the combiner is not 
injected before the Reduce, because the data is already partitioned).

One more thing. Adding a {{CombineHint.HASH}} to the {{CentroidAccumulator}} as 
follows: 
{code}
.groupBy(0).reduce(new 
CentroidAccumulator()).setCombineHint(ReduceOperatorBase.CombineHint.HASH)
{code}

might speed up things as well.

> ForwardedFields annotation is not generating optimised execution plan in 
> example KMeans job
> ---
>
> Key: FLINK-5888
> URL: https://issues.apache.org/jira/browse/FLINK-5888
> Project: Flink
>  Issue Type: Bug
>  Components: DataSet API, Examples, Java API
>Affects Versions: 1.1.3
>Reporter: Ziyad Muhammed Mohiyudheen
>
> Flink KMeans java example [1] shows the usage of ForwardedFields function 
> annotation. How ever, the example job was taking more time than expected on 
> medium sized data itself. By merely removing the function annotation from the 
> example code (with out any other change), a better execution plan and run 
> time was obtained. The execution plan shows that no combiner is used and the 
> two Map tasks are not chained when ForwardedFields is enabled. The experiment 
> is documented in [2]
> [1] 
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
> [2] https://drive.google.com/open?id=0B0IlZv0uHBuvVEZ5ZmNpN19jVVU



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5888) ForwardedFields annotation is not generating optimised execution plan in example KMeans job

2017-02-22 Thread Gabor Gevay (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878779#comment-15878779
 ] 

Gabor Gevay commented on FLINK-5888:


Just a wild guess: Is it possible that the shuffle actually ended up between 
the two maps, and it is somehow not visible on the graph? I'm thinking this 
because I don't see how would the annotation make the shuffle disappear (but 
it's possible that I'm misunderstanding something).

> ForwardedFields annotation is not generating optimised execution plan in 
> example KMeans job
> ---
>
> Key: FLINK-5888
> URL: https://issues.apache.org/jira/browse/FLINK-5888
> Project: Flink
>  Issue Type: Bug
>  Components: DataSet API, Examples, Java API
>Affects Versions: 1.1.3
>Reporter: Ziyad Muhammed Mohiyudheen
>
> Flink KMeans java example [1] shows the usage of ForwardedFields function 
> annotation. How ever, the example job was taking more time than expected on 
> medium sized data itself. By merely removing the function annotation from the 
> example code (with out any other change), a better execution plan and run 
> time was obtained. The execution plan shows that no combiner is used and the 
> two Map tasks are not chained when ForwardedFields is enabled. The experiment 
> is documented in [2]
> [1] 
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
> [2] https://drive.google.com/open?id=0B0IlZv0uHBuvVEZ5ZmNpN19jVVU



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5888) ForwardedFields annotation is not generating optimised execution plan in example KMeans job

2017-02-22 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878719#comment-15878719
 ] 

Fabian Hueske commented on FLINK-5888:
--

Thanks for reporting this issue.

The optimizer optimizes primarily for reduction of network traffic.
The plan with annotations does not add a combiner because it does not need to 
shuffle data (FORWARD instead of HASH PARTITION ship strategy) between 
CountAppender and CentroidAccumulator. 
This effect does not become visible because you run the program with a 
parallelism of 1. You might need to run it on a cluster to see an improvement.

I'm not sure though, why the two mapper are not chained in case of the program 
with annotations.

> ForwardedFields annotation is not generating optimised execution plan in 
> example KMeans job
> ---
>
> Key: FLINK-5888
> URL: https://issues.apache.org/jira/browse/FLINK-5888
> Project: Flink
>  Issue Type: Bug
>  Components: DataSet API, Examples, Java API
>Affects Versions: 1.1.3
>Reporter: Ziyad Muhammed Mohiyudheen
>
> Flink KMeans java example [1] shows the usage of ForwardedFields function 
> annotation. How ever, the example job was taking more time than expected on 
> medium sized data itself. By merely removing the function annotation from the 
> example code (with out any other change), a better execution plan and run 
> time was obtained. The execution plan shows that no combiner is used and the 
> two Map tasks are not chained when ForwardedFields is enabled. The experiment 
> is documented in [2]
> [1] 
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
> [2] https://drive.google.com/open?id=0B0IlZv0uHBuvVEZ5ZmNpN19jVVU



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)