[jira] [Commented] (FLINK-15007) Flink on YARN does not request required TaskExecutors in some cases

2019-12-03 Thread Andrey Zagrebin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987055#comment-16987055
 ] 

Andrey Zagrebin commented on FLINK-15007:
-

Thanks for the explanation [~xintongsong], makes sense to me.
[~aljoscha] I think we can close this issue if you agree

> Flink on YARN does not request required TaskExecutors in some cases
> ---
>
> Key: FLINK-15007
> URL: https://issues.apache.org/jira/browse/FLINK-15007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.10.0
>Reporter: Aljoscha Krettek
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: DualInputWordCount.jar
>
>
> This was discovered while debugging FLINK-14834. In some cases Flink does not 
> request new {{TaskExecutors}} even though new slots are requested. You can 
> see this in some of the logs attached to FLINK-14834.
> You can reproduce this using 
> [https://github.com/aljoscha/docker-hadoop-cluster] to bring up a YARN 
> cluster and then running a compiled Flink in there.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> the job waits and eventually fails because it does not have enough slots. 
> (You can see in the log that 3 new slots are requested but only 2 
> {{TaskExecutors}} are requested.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> runs successfully.
> This is the {{git bisect}} log that identifies the first faulty commit 
> ([https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4|https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4]):
> {code:java}
> git bisect start
> # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the 
> class for multivariate Gaussian Distribution.
> git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
> # bad: [85905f80e9711967711c2992612dccdd2cc211ac] [FLINK-14834][tests] 
> Disable flaky yarn_kerberos_docker (default input) test
> git bisect bad 85905f80e9711967711c2992612dccdd2cc211ac
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # bad: [ae539c97c858b94e0e2504b54a8517ac1383482a] [hotfix][runtime] Check 
> managed memory fraction range when setting it into StreamConfig
> git bisect bad ae539c97c858b94e0e2504b54a8517ac1383482a
> # good: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out 
> slots creation from the TaskSlotTable constructor
> git bisect good 01d6972ab267807b8afccb09a45c454fa76d6c4b
> # bad: [d32e1d00854e24bc4bb3aad6d866c2d709acd993] [FLINK-14594][core] Change 
> Resource to use BigDecimal as its value
> git bisect bad d32e1d00854e24bc4bb3aad6d866c2d709acd993
> # bad: [25f87ec208a642283e995811d809632129ca289a] 
> [FLINK-11935][table-planner] Fix cast timestamp/date to string to avoid 
> Gregorian cutover
> git bisect bad 25f87ec208a642283e995811d809632129ca289a
> # bad: [21c6b85a6f5991aabcbcd41fedc860d662d478fb] [FLINK-14842][table] add 
> logging for loaded modules and functions
> git bisect bad 21c6b85a6f5991aabcbcd41fedc860d662d478fb
> # bad: [48986aa0b89de731d1b9136b59d409933cc15408] [hotfix] Remove unnecessary 
> comments about memory size calculation before network init
> git bisect bad 48986aa0b89de731d1b9136b59d409933cc15408
> # bad: [4c4652efa43ed8ab456f5f63c89b57d8c4a621f8] [hotfix] Remove unused 
> number of slots in MemoryManager
> git bisect bad 4c4652efa43ed8ab456f5f63c89b57d8c4a621f8
> # bad: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] Shrink the 
> scope of MemoryManager from TaskExecutor to slot
> git bisect bad 9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4
> # first bad commit: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] 
> Shrink the scope of MemoryManager from TaskExecutor to slot
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15007) Flink on YARN does not request required TaskExecutors in some cases

2019-12-03 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987046#comment-16987046
 ] 

Xintong Song commented on FLINK-15007:
--

The matching between pending slots and registered slots happens in 
{{SlotManagerImpl#registerSlot}}, {{findExactlyMatchingPendingTaskManagerSlot}} 
to be specific. If the resource profile does not match, 
{{findExactlyMatchingPendingTaskManagerSlot}} will return {{null}}, thus 
{{pendingTaskManagerSlot}} will also be {{null}}, and the codes will not 
executed to the else branch below, where {{pendingTaskManagerSlot}} is removed 
from {{pendingSlots}}.

If you look at 
{{ActiveResourceManagerFactory#createActiveResourceManagerConfiguration}} in 
commit {{4d28d3f783e48708ab1175b323aa9ed332cf207e}} (or other commits with this 
problem), you will find that managed memory size is explicitly set into 
configuration in bytes. This value is later used in 
{{ResourceManager#createWorkerSlotProfiles}} for creating profiles of pending 
slots.

On the TM side, the explicitly configured managed memory size is first read in 
{{TaskManagerServicesConfiguration#fromConfiguration}}, by 
{{ConfigurationParserUtils#getManagedMemorySize}} which parses the bytes value 
but only returns the megabytes value. Then in 
{{TaskManagerServices#calculateMemorySizeByType}}, this megabytes value is 
converted back to bytes, which is used for calculating slot profile on the TM 
side in {{TaskManagerServices#computeSlotResourceProfile}}.

The conversion from bytes into megabytes ({{>> 20}}), then to bytes again ({{<< 
20}}) leads to accuracy loss.

> Flink on YARN does not request required TaskExecutors in some cases
> ---
>
> Key: FLINK-15007
> URL: https://issues.apache.org/jira/browse/FLINK-15007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.10.0
>Reporter: Aljoscha Krettek
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: DualInputWordCount.jar
>
>
> This was discovered while debugging FLINK-14834. In some cases Flink does not 
> request new {{TaskExecutors}} even though new slots are requested. You can 
> see this in some of the logs attached to FLINK-14834.
> You can reproduce this using 
> [https://github.com/aljoscha/docker-hadoop-cluster] to bring up a YARN 
> cluster and then running a compiled Flink in there.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> the job waits and eventually fails because it does not have enough slots. 
> (You can see in the log that 3 new slots are requested but only 2 
> {{TaskExecutors}} are requested.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> runs successfully.
> This is the {{git bisect}} log that identifies the first faulty commit 
> ([https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4|https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4]):
> {code:java}
> git bisect start
> # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the 
> class for multivariate Gaussian Distribution.
> git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
> # bad: [85905f80e9711967711c2992612dccdd2cc211ac] [FLINK-14834][tests] 
> Disable flaky yarn_kerberos_docker (default input) test
> git bisect bad 85905f80e9711967711c2992612dccdd2cc211ac
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # bad: [ae539c97c858b94e0e2504b54a8517ac1383482a] [hotfix][runtime] Check 
> managed memory fraction range when setting it into StreamConfig
> git bisect bad ae539c97c858b94e0e2504b54a8517ac1383482a
> # good: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out 
> slots creation from the TaskSlotTable constructor
> git bisect good 01d6972ab267807b8afccb09a45c454fa76d6c4b
> # bad: [d32e1d00854e24bc4bb3aad6d866c2d709acd993] [FLINK-14594][core] Change 
> Resource to use BigDecimal as its value
> git bisect bad d32e1d00854e24bc4bb3aad6d866c2d709acd993
> # bad: [25f87ec208a642283e995811d809632129ca289a] 
> [FLINK-11935][table-planner] Fix cast timestamp/date to string to avoid 
> Gregorian cutover
> git bisect bad 25f87ec208

[jira] [Commented] (FLINK-15007) Flink on YARN does not request required TaskExecutors in some cases

2019-12-03 Thread Andrey Zagrebin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987021#comment-16987021
 ] 

Andrey Zagrebin commented on FLINK-15007:
-

I checked, the problem is solved after merging FLIP-49: 
[https://github.com/apache/flink/pull/10161]
It would be still nice to understand in detail where in source code the slot 
request was rejected for the newly launched TM
[~xintongsong] could you post references in source code where it happens? and 
maybe where the calculations diverged

> Flink on YARN does not request required TaskExecutors in some cases
> ---
>
> Key: FLINK-15007
> URL: https://issues.apache.org/jira/browse/FLINK-15007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.10.0
>Reporter: Aljoscha Krettek
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: DualInputWordCount.jar
>
>
> This was discovered while debugging FLINK-14834. In some cases Flink does not 
> request new {{TaskExecutors}} even though new slots are requested. You can 
> see this in some of the logs attached to FLINK-14834.
> You can reproduce this using 
> [https://github.com/aljoscha/docker-hadoop-cluster] to bring up a YARN 
> cluster and then running a compiled Flink in there.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> the job waits and eventually fails because it does not have enough slots. 
> (You can see in the log that 3 new slots are requested but only 2 
> {{TaskExecutors}} are requested.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> runs successfully.
> This is the {{git bisect}} log that identifies the first faulty commit 
> ([https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4|https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4]):
> {code:java}
> git bisect start
> # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the 
> class for multivariate Gaussian Distribution.
> git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
> # bad: [85905f80e9711967711c2992612dccdd2cc211ac] [FLINK-14834][tests] 
> Disable flaky yarn_kerberos_docker (default input) test
> git bisect bad 85905f80e9711967711c2992612dccdd2cc211ac
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # bad: [ae539c97c858b94e0e2504b54a8517ac1383482a] [hotfix][runtime] Check 
> managed memory fraction range when setting it into StreamConfig
> git bisect bad ae539c97c858b94e0e2504b54a8517ac1383482a
> # good: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out 
> slots creation from the TaskSlotTable constructor
> git bisect good 01d6972ab267807b8afccb09a45c454fa76d6c4b
> # bad: [d32e1d00854e24bc4bb3aad6d866c2d709acd993] [FLINK-14594][core] Change 
> Resource to use BigDecimal as its value
> git bisect bad d32e1d00854e24bc4bb3aad6d866c2d709acd993
> # bad: [25f87ec208a642283e995811d809632129ca289a] 
> [FLINK-11935][table-planner] Fix cast timestamp/date to string to avoid 
> Gregorian cutover
> git bisect bad 25f87ec208a642283e995811d809632129ca289a
> # bad: [21c6b85a6f5991aabcbcd41fedc860d662d478fb] [FLINK-14842][table] add 
> logging for loaded modules and functions
> git bisect bad 21c6b85a6f5991aabcbcd41fedc860d662d478fb
> # bad: [48986aa0b89de731d1b9136b59d409933cc15408] [hotfix] Remove unnecessary 
> comments about memory size calculation before network init
> git bisect bad 48986aa0b89de731d1b9136b59d409933cc15408
> # bad: [4c4652efa43ed8ab456f5f63c89b57d8c4a621f8] [hotfix] Remove unused 
> number of slots in MemoryManager
> git bisect bad 4c4652efa43ed8ab456f5f63c89b57d8c4a621f8
> # bad: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] Shrink the 
> scope of MemoryManager from TaskExecutor to slot
> git bisect bad 9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4
> # first bad commit: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] 
> Shrink the scope of MemoryManager from TaskExecutor to slot
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15007) Flink on YARN does not request required TaskExecutors in some cases

2019-12-03 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986770#comment-16986770
 ] 

Yang Wang commented on FLINK-15007:
---

I think [~xintongsong] is right. I am developing the active Flink Kubernetes 
Integration and come across the same problem. Set different 
`taskmanager.heap.size` may lead to different result. The root cause is 
precision loss in TaskManager when convert to bytes value to mega bytes value.

> Flink on YARN does not request required TaskExecutors in some cases
> ---
>
> Key: FLINK-15007
> URL: https://issues.apache.org/jira/browse/FLINK-15007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.10.0
>Reporter: Aljoscha Krettek
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: DualInputWordCount.jar
>
>
> This was discovered while debugging FLINK-14834. In some cases Flink does not 
> request new {{TaskExecutors}} even though new slots are requested. You can 
> see this in some of the logs attached to FLINK-14834.
> You can reproduce this using 
> [https://github.com/aljoscha/docker-hadoop-cluster] to bring up a YARN 
> cluster and then running a compiled Flink in there.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> the job waits and eventually fails because it does not have enough slots. 
> (You can see in the log that 3 new slots are requested but only 2 
> {{TaskExecutors}} are requested.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> runs successfully.
> This is the {{git bisect}} log that identifies the first faulty commit 
> ([https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4|https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4]):
> {code:java}
> git bisect start
> # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the 
> class for multivariate Gaussian Distribution.
> git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
> # bad: [85905f80e9711967711c2992612dccdd2cc211ac] [FLINK-14834][tests] 
> Disable flaky yarn_kerberos_docker (default input) test
> git bisect bad 85905f80e9711967711c2992612dccdd2cc211ac
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # bad: [ae539c97c858b94e0e2504b54a8517ac1383482a] [hotfix][runtime] Check 
> managed memory fraction range when setting it into StreamConfig
> git bisect bad ae539c97c858b94e0e2504b54a8517ac1383482a
> # good: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out 
> slots creation from the TaskSlotTable constructor
> git bisect good 01d6972ab267807b8afccb09a45c454fa76d6c4b
> # bad: [d32e1d00854e24bc4bb3aad6d866c2d709acd993] [FLINK-14594][core] Change 
> Resource to use BigDecimal as its value
> git bisect bad d32e1d00854e24bc4bb3aad6d866c2d709acd993
> # bad: [25f87ec208a642283e995811d809632129ca289a] 
> [FLINK-11935][table-planner] Fix cast timestamp/date to string to avoid 
> Gregorian cutover
> git bisect bad 25f87ec208a642283e995811d809632129ca289a
> # bad: [21c6b85a6f5991aabcbcd41fedc860d662d478fb] [FLINK-14842][table] add 
> logging for loaded modules and functions
> git bisect bad 21c6b85a6f5991aabcbcd41fedc860d662d478fb
> # bad: [48986aa0b89de731d1b9136b59d409933cc15408] [hotfix] Remove unnecessary 
> comments about memory size calculation before network init
> git bisect bad 48986aa0b89de731d1b9136b59d409933cc15408
> # bad: [4c4652efa43ed8ab456f5f63c89b57d8c4a621f8] [hotfix] Remove unused 
> number of slots in MemoryManager
> git bisect bad 4c4652efa43ed8ab456f5f63c89b57d8c4a621f8
> # bad: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] Shrink the 
> scope of MemoryManager from TaskExecutor to slot
> git bisect bad 9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4
> # first bad commit: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] 
> Shrink the scope of MemoryManager from TaskExecutor to slot
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15007) Flink on YARN does not request required TaskExecutors in some cases

2019-12-02 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986674#comment-16986674
 ] 

Xintong Song commented on FLINK-15007:
--

I think I find the cause of this issue.

I was looking at this [log|https://api.travis-ci.org/v3/job/614505046/log.txt] 
that [~gjy] reported in 
[FLINK-14834|https://issues.apache.org/jira/browse/FLINK-14834].

According to the log, RM first received on slot request, and started a TM for 
that request. After the first TM is registered, RM received another 2 slot 
requests, this time it started only 1 TM. We have seen similar problems before.

When RM starts a TM, it also generates pending task slots according to the 
number of slots the new TM should have (in our case 1), so that RM can assign 
slot requests to pending slots before the TM is registered. Later when TM 
registers to RM, the registered actual slots will be mapped to the pending 
slots if they have the same profile, and slot requests assigned to the pending 
slot will be assigned to the corresponding actual slot. If the profile of 
registered slot is different from the pending slot, the pending slot will not 
be consumed.

In our case, when the first request arrives, RM starts a new TM, generates a 
pending slot, and assign the request to the pending slot. When the TM is 
registered, the pending slot is not consumed because the slot profile doesn't 
match. Yet, the request will be assigned to the registered slot, and the 
pending slot becomes unassigned. Then the next 2 slot requests arrive, 1 of the 
request will be assigned to the pending slot, so RM only starts new TM for the 
other request. As a result, there's no pending TM and the slot request assigned 
to the pending slot will never be satisfied.

The reason that pending slot and registered slot may have different profile, is 
because memory sizes on RM / TM are calculated from configuration in different 
code paths, so the calculated results may suffer from slight rounding errors 
due to the fraction multiplying. Previously in 1.9, we solve this problem by 
explicitly overwriting managed memory size in the configuration to make sure RM 
/ TM have the same managed memory size, which is the only value in slot profile 
the is actually used. That solves most of the problem. But we failed to 
discover the problem in 1.9 that there is a bytes -> megabytes -> bytes 
conversion of managed memory size on TM side, which may also introduce errors. 
In 1.9 all the fields of ResourceProfile are represented by MB, so the 
conversion should not cause problems. But recently we changed them to 
MemorySize that stores the bytes value, which uncovers the accuracy loss.

I think this also explains [~aljoscha]'s finding that not all the value of 
`taskmanager.heap.size` triggers the problem. Because different value of 
`taskmanager.heap.size` can result in different managed memory size, and not 
all managed memory sizes have accuracy loss in the conversions.

The problem should be fixed by FLIP-49, which will be merged soon. In FLIP-49 
we calculated the memory sizes with exactly same code paths, to eliminate the 
differences between resource profiles calculated on RM / TM sides.

> Flink on YARN does not request required TaskExecutors in some cases
> ---
>
> Key: FLINK-15007
> URL: https://issues.apache.org/jira/browse/FLINK-15007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.10.0
>Reporter: Aljoscha Krettek
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: DualInputWordCount.jar
>
>
> This was discovered while debugging FLINK-14834. In some cases Flink does not 
> request new {{TaskExecutors}} even though new slots are requested. You can 
> see this in some of the logs attached to FLINK-14834.
> You can reproduce this using 
> [https://github.com/aljoscha/docker-hadoop-cluster] to bring up a YARN 
> cluster and then running a compiled Flink in there.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> the job waits and eventually fails because it does not have enough slots. 
> (You can see in the log that 3 new slots are requested but only 2 
> {{TaskExecutors}} are requested.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> runs successfully.
> This is the {{git bisect}} log that identifies the first faulty commit 
> ([https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4|https://gi

[jira] [Commented] (FLINK-15007) Flink on YARN does not request required TaskExecutors in some cases

2019-12-02 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986086#comment-16986086
 ] 

Zhu Zhu commented on FLINK-15007:
-

Each TM contains 3 slots, and one TM should be able to carry the job. So the 
issue may be that why the RM requested 2 new TMs when only 3 new slots are 
requested.
Another issue is that why the 2nd requested TM cannot be fulfilled, yet I'm not 
sure if it's an expected cluster limitation.

> Flink on YARN does not request required TaskExecutors in some cases
> ---
>
> Key: FLINK-15007
> URL: https://issues.apache.org/jira/browse/FLINK-15007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.10.0
>Reporter: Aljoscha Krettek
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: DualInputWordCount.jar
>
>
> This was discovered while debugging FLINK-14834. In some cases Flink does not 
> request new {{TaskExecutors}} even though new slots are requested. You can 
> see this in some of the logs attached to FLINK-14834.
> You can reproduce this using 
> [https://github.com/aljoscha/docker-hadoop-cluster] to bring up a YARN 
> cluster and then running a compiled Flink in there.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> the job waits and eventually fails because it does not have enough slots. 
> (You can see in the log that 3 new slots are requested but only 2 
> {{TaskExecutors}} are requested.
> When you run
> {code:java}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --output hdfs:///wc-out 
> && hdfs dfs -rm -r /wc-out
> {code}
> runs successfully.
> This is the {{git bisect}} log that identifies the first faulty commit 
> ([https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4|https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4]):
> {code:java}
> git bisect start
> # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the 
> class for multivariate Gaussian Distribution.
> git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
> # bad: [85905f80e9711967711c2992612dccdd2cc211ac] [FLINK-14834][tests] 
> Disable flaky yarn_kerberos_docker (default input) test
> git bisect bad 85905f80e9711967711c2992612dccdd2cc211ac
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # bad: [ae539c97c858b94e0e2504b54a8517ac1383482a] [hotfix][runtime] Check 
> managed memory fraction range when setting it into StreamConfig
> git bisect bad ae539c97c858b94e0e2504b54a8517ac1383482a
> # good: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out 
> slots creation from the TaskSlotTable constructor
> git bisect good 01d6972ab267807b8afccb09a45c454fa76d6c4b
> # bad: [d32e1d00854e24bc4bb3aad6d866c2d709acd993] [FLINK-14594][core] Change 
> Resource to use BigDecimal as its value
> git bisect bad d32e1d00854e24bc4bb3aad6d866c2d709acd993
> # bad: [25f87ec208a642283e995811d809632129ca289a] 
> [FLINK-11935][table-planner] Fix cast timestamp/date to string to avoid 
> Gregorian cutover
> git bisect bad 25f87ec208a642283e995811d809632129ca289a
> # bad: [21c6b85a6f5991aabcbcd41fedc860d662d478fb] [FLINK-14842][table] add 
> logging for loaded modules and functions
> git bisect bad 21c6b85a6f5991aabcbcd41fedc860d662d478fb
> # bad: [48986aa0b89de731d1b9136b59d409933cc15408] [hotfix] Remove unnecessary 
> comments about memory size calculation before network init
> git bisect bad 48986aa0b89de731d1b9136b59d409933cc15408
> # bad: [4c4652efa43ed8ab456f5f63c89b57d8c4a621f8] [hotfix] Remove unused 
> number of slots in MemoryManager
> git bisect bad 4c4652efa43ed8ab456f5f63c89b57d8c4a621f8
> # bad: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] Shrink the 
> scope of MemoryManager from TaskExecutor to slot
> git bisect bad 9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4
> # first bad commit: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] 
> Shrink the scope of MemoryManager from TaskExecutor to slot
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15007) Flink on YARN does not request required TaskExecutors in some cases

2019-12-02 Thread Aljoscha Krettek (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985957#comment-16985957
 ] 

Aljoscha Krettek commented on FLINK-15007:
--

Forgot to attach the example. This is the streaming WordCount example that I 
modified to have two "inputs", similar to how the WordCount example is used in 
the YARN/kerberos/Docker test. Instead of using the input once we use it like 
this:
{code}
text = 
env.readTextFile(params.get("input")).union(env.readTextFile(params.get("input")));
{code}
to create two inputs from the same path.

> Flink on YARN does not request required TaskExecutors in some cases
> ---
>
> Key: FLINK-15007
> URL: https://issues.apache.org/jira/browse/FLINK-15007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.10.0
>Reporter: Aljoscha Krettek
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: DualInputWordCount.jar
>
>
> This was discovered while debugging FLINK-14834. In some cases Flink does not 
> request new {{TaskExecutors}} even though new slots are requested. You can 
> see this in some of the logs attached to FLINK-14834.
> You can reproduce this using 
> https://github.com/aljoscha/docker-hadoop-cluster to bring up a YARN cluster 
> and then running a compiled Flink in there.
> When you run
> {code}
> bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --input hdfs:///wc-in-2 
> --output hdfs:///wc-out && hdfs dfs -rm -r /wc-out
> {code}
> the job waits and eventually fails because it does not have enough slots. 
> (You can see in the log that 3 new slots are requested but only 2 
> {{TaskExecutors}} are requested.
> When you run
> {code}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --input hdfs:///wc-in-2 
> --output hdfs:///wc-out && hdfs dfs -rm -r /wc-out
> {code}
> runs successfully.
> This is the {{git bisect}} log that identifies the first faulty commit 
> ([https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4|http://example.com]):
>  
> {code}
> git bisect start
> # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the 
> class for multivariate Gaussian Distribution.
> git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
> # bad: [85905f80e9711967711c2992612dccdd2cc211ac] [FLINK-14834][tests] 
> Disable flaky yarn_kerberos_docker (default input) test
> git bisect bad 85905f80e9711967711c2992612dccdd2cc211ac
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # bad: [ae539c97c858b94e0e2504b54a8517ac1383482a] [hotfix][runtime] Check 
> managed memory fraction range when setting it into StreamConfig
> git bisect bad ae539c97c858b94e0e2504b54a8517ac1383482a
> # good: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out 
> slots creation from the TaskSlotTable constructor
> git bisect good 01d6972ab267807b8afccb09a45c454fa76d6c4b
> # bad: [d32e1d00854e24bc4bb3aad6d866c2d709acd993] [FLINK-14594][core] Change 
> Resource to use BigDecimal as its value
> git bisect bad d32e1d00854e24bc4bb3aad6d866c2d709acd993
> # bad: [25f87ec208a642283e995811d809632129ca289a] 
> [FLINK-11935][table-planner] Fix cast timestamp/date to string to avoid 
> Gregorian cutover
> git bisect bad 25f87ec208a642283e995811d809632129ca289a
> # bad: [21c6b85a6f5991aabcbcd41fedc860d662d478fb] [FLINK-14842][table] add 
> logging for loaded modules and functions
> git bisect bad 21c6b85a6f5991aabcbcd41fedc860d662d478fb
> # bad: [48986aa0b89de731d1b9136b59d409933cc15408] [hotfix] Remove unnecessary 
> comments about memory size calculation before network init
> git bisect bad 48986aa0b89de731d1b9136b59d409933cc15408
> # bad: [4c4652efa43ed8ab456f5f63c89b57d8c4a621f8] [hotfix] Remove unused 
> number of slots in MemoryManager
> git bisect bad 4c4652efa43ed8ab456f5f63c89b57d8c4a621f8
> # bad: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] Shrink the 
> scope of MemoryManager from TaskExecutor to slot
> git bisect bad 9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4
> # first bad commit: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] 
> Shrink the scope of MemoryManager from TaskExecutor to slot
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15007) Flink on YARN does not request required TaskExecutors in some cases

2019-12-02 Thread Zili Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985920#comment-16985920
 ] 

Zili Chen commented on FLINK-15007:
---

CC [~fly_in_gis] do you have some ideas?

> Flink on YARN does not request required TaskExecutors in some cases
> ---
>
> Key: FLINK-15007
> URL: https://issues.apache.org/jira/browse/FLINK-15007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.10.0
>Reporter: Aljoscha Krettek
>Priority: Blocker
> Fix For: 1.10.0
>
>
> This was discovered while debugging FLINK-14834. In some cases Flink does not 
> request new {{TaskExecutors}} even though new slots are requested. You can 
> see this in some of the logs attached to FLINK-14834.
> You can reproduce this using 
> https://github.com/aljoscha/docker-hadoop-cluster to bring up a YARN cluster 
> and then running a compiled Flink in there.
> When you run
> {code}
> bin/flink run -m yarn-cluster -p 3 -yjm 1200 -ytm 1200 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --input hdfs:///wc-in-2 
> --output hdfs:///wc-out && hdfs dfs -rm -r /wc-out
> {code}
> the job waits and eventually fails because it does not have enough slots. 
> (You can see in the log that 3 new slots are requested but only 2 
> {{TaskExecutors}} are requested.
> When you run
> {code}
> bin/flink run -m yarn-cluster -p 3 -yjm 1224 -ytm 1224 
> /root/DualInputWordCount.jar --input hdfs:///wc-in-1 --input hdfs:///wc-in-2 
> --output hdfs:///wc-out && hdfs dfs -rm -r /wc-out
> {code}
> runs successfully.
> This is the {{git bisect}} log that identifies the first faulty commit 
> ([https://github.com/apache/flink/commit/9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4|http://example.com]):
>  
> {code}
> git bisect start
> # good: [09f2f43a1d73c76bf4d3f4a1205269eb860deb14] [FLINK-14154][ml] Add the 
> class for multivariate Gaussian Distribution.
> git bisect good 09f2f43a1d73c76bf4d3f4a1205269eb860deb14
> # bad: [85905f80e9711967711c2992612dccdd2cc211ac] [FLINK-14834][tests] 
> Disable flaky yarn_kerberos_docker (default input) test
> git bisect bad 85905f80e9711967711c2992612dccdd2cc211ac
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # good: [c9c8a29b1b2e4f2886fba1524432f9788b564e61] 
> [FLINK-14759][coordination] Remove unused class TaskManagerCliOptions
> git bisect good c9c8a29b1b2e4f2886fba1524432f9788b564e61
> # bad: [ae539c97c858b94e0e2504b54a8517ac1383482a] [hotfix][runtime] Check 
> managed memory fraction range when setting it into StreamConfig
> git bisect bad ae539c97c858b94e0e2504b54a8517ac1383482a
> # good: [01d6972ab267807b8afccb09a45c454fa76d6c4b] [hotfix] Refactor out 
> slots creation from the TaskSlotTable constructor
> git bisect good 01d6972ab267807b8afccb09a45c454fa76d6c4b
> # bad: [d32e1d00854e24bc4bb3aad6d866c2d709acd993] [FLINK-14594][core] Change 
> Resource to use BigDecimal as its value
> git bisect bad d32e1d00854e24bc4bb3aad6d866c2d709acd993
> # bad: [25f87ec208a642283e995811d809632129ca289a] 
> [FLINK-11935][table-planner] Fix cast timestamp/date to string to avoid 
> Gregorian cutover
> git bisect bad 25f87ec208a642283e995811d809632129ca289a
> # bad: [21c6b85a6f5991aabcbcd41fedc860d662d478fb] [FLINK-14842][table] add 
> logging for loaded modules and functions
> git bisect bad 21c6b85a6f5991aabcbcd41fedc860d662d478fb
> # bad: [48986aa0b89de731d1b9136b59d409933cc15408] [hotfix] Remove unnecessary 
> comments about memory size calculation before network init
> git bisect bad 48986aa0b89de731d1b9136b59d409933cc15408
> # bad: [4c4652efa43ed8ab456f5f63c89b57d8c4a621f8] [hotfix] Remove unused 
> number of slots in MemoryManager
> git bisect bad 4c4652efa43ed8ab456f5f63c89b57d8c4a621f8
> # bad: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] Shrink the 
> scope of MemoryManager from TaskExecutor to slot
> git bisect bad 9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4
> # first bad commit: [9fed0ddc5bc015f98246a2d8d9adbe5fb2b91ba4] [FLINK-14400] 
> Shrink the scope of MemoryManager from TaskExecutor to slot
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)