[jira] [Commented] (FLINK-14607) SharedSlot cannot fulfill pending slot requests before it's completely released

2020-08-28 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186484#comment-17186484
 ] 

Zhu Zhu commented on FLINK-14607:
-

It would be not a problem. I will close this ticket.

> SharedSlot cannot fulfill pending slot requests before it's completely 
> released
> ---
>
> Key: FLINK-14607
> URL: https://issues.apache.org/jira/browse/FLINK-14607
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.1, 1.10.0
>Reporter: Zhu Zhu
>Priority: Minor
>
> Currently a pending request can only be fulfilled when a physical 
> slot({{AllocatedSlot}}) becomes available in {{SlotPool}}.
> A shared slot however, cannot be used to fulfill pending requests even if it 
> becomes qualified. This may lead to resource deadlocks in certain cases.
> For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) 
> with 1 slot only, all vertices are in the same slot sharing group, here's 
> what may happen:
> 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is 
> pending because a slot cannot host 2 instances of the same JobVertex at the 
> same time. Shared slot status: \{A1\}
> 2. A1 produces data and triggers the scheduling of B1. Shared slot status: 
> \{A1, B1\}
> 3. A1 finishes. Shared slot status: \{B1\}
> 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched 
> due to no physical slot becomes available, even though the shred slot is 
> qualified to host it now. A resource deadlock happens.
> Maybe we should improve {{SlotSharingManager}}. One a task slot is released, 
> its root {{MultiTaskSlot}} should be used to try fulfilling existing pending 
> task slots from other pending root slots({{unresolvedRootSlots}}) in this 
> {{SlotSharingManager}}(means in the same slot sharing group).
> We need to be careful to not cause any failures, and do not violate 
> colocation constraints.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14607) SharedSlot cannot fulfill pending slot requests before it's completely released

2020-08-28 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186444#comment-17186444
 ] 

Till Rohrmann commented on FLINK-14607:
---

[~zhuzh] can this problem still occur with the pipelined region scheduler?

> SharedSlot cannot fulfill pending slot requests before it's completely 
> released
> ---
>
> Key: FLINK-14607
> URL: https://issues.apache.org/jira/browse/FLINK-14607
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.1, 1.10.0
>Reporter: Zhu Zhu
>Priority: Minor
>
> Currently a pending request can only be fulfilled when a physical 
> slot({{AllocatedSlot}}) becomes available in {{SlotPool}}.
> A shared slot however, cannot be used to fulfill pending requests even if it 
> becomes qualified. This may lead to resource deadlocks in certain cases.
> For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) 
> with 1 slot only, all vertices are in the same slot sharing group, here's 
> what may happen:
> 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is 
> pending because a slot cannot host 2 instances of the same JobVertex at the 
> same time. Shared slot status: \{A1\}
> 2. A1 produces data and triggers the scheduling of B1. Shared slot status: 
> \{A1, B1\}
> 3. A1 finishes. Shared slot status: \{B1\}
> 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched 
> due to no physical slot becomes available, even though the shred slot is 
> qualified to host it now. A resource deadlock happens.
> Maybe we should improve {{SlotSharingManager}}. One a task slot is released, 
> its root {{MultiTaskSlot}} should be used to try fulfilling existing pending 
> task slots from other pending root slots({{unresolvedRootSlots}}) in this 
> {{SlotSharingManager}}(means in the same slot sharing group).
> We need to be careful to not cause any failures, and do not violate 
> colocation constraints.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14607) SharedSlot cannot fulfill pending slot requests before it's completely released

2019-11-05 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968042#comment-16968042
 ] 

Zhu Zhu commented on FLINK-14607:
-

Thanks for the explanation. The slot sharing can be simplified a lot in this 
way.
I have marked this issue to minor.

> SharedSlot cannot fulfill pending slot requests before it's completely 
> released
> ---
>
> Key: FLINK-14607
> URL: https://issues.apache.org/jira/browse/FLINK-14607
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.9.1
>Reporter: Zhu Zhu
>Priority: Minor
>
> Currently a pending request can only be fulfilled when a physical 
> slot({{AllocatedSlot}}) becomes available in {{SlotPool}}.
> A shared slot however, cannot be used to fulfill pending requests even if it 
> becomes qualified. This may lead to resource deadlocks in certain cases.
> For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) 
> with 1 slot only, all vertices are in the same slot sharing group, here's 
> what may happen:
> 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is 
> pending because a slot cannot host 2 instances of the same JobVertex at the 
> same time. Shared slot status: \{A1\}
> 2. A1 produces data and triggers the scheduling of B1. Shared slot status: 
> \{A1, B1\}
> 3. A1 finishes. Shared slot status: \{B1\}
> 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched 
> due to no physical slot becomes available, even though the shred slot is 
> qualified to host it now. A resource deadlock happens.
> Maybe we should improve {{SlotSharingManager}}. One a task slot is released, 
> its root {{MultiTaskSlot}} should be used to try fulfilling existing pending 
> task slots from other pending root slots({{unresolvedRootSlots}}) in this 
> {{SlotSharingManager}}(means in the same slot sharing group).
> We need to be careful to not cause any failures, and do not violate 
> colocation constraints.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14607) SharedSlot cannot fulfill pending slot requests before it's completely released

2019-11-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967612#comment-16967612
 ] 

Till Rohrmann commented on FLINK-14607:
---

Yes exactly. Ideally, the {{ExecutionSlotAllocator}} receives all executions 
belonging to a single pipelined region and then makes sure that either all or 
none get assigned to a slot. Moreover, we only allow slot sharing between tasks 
belonging to the same pipelined region. That way we can statically compute the 
slot sharing at scheduling time and don't need to touch it later.

> SharedSlot cannot fulfill pending slot requests before it's completely 
> released
> ---
>
> Key: FLINK-14607
> URL: https://issues.apache.org/jira/browse/FLINK-14607
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.9.1
>Reporter: Zhu Zhu
>Priority: Major
>
> Currently a pending request can only be fulfilled when a physical 
> slot({{AllocatedSlot}}) becomes available in {{SlotPool}}.
> A shared slot however, cannot be used to fulfill pending requests even if it 
> becomes qualified. This may lead to resource deadlocks in certain cases.
> For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) 
> with 1 slot only, all vertices are in the same slot sharing group, here's 
> what may happen:
> 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is 
> pending because a slot cannot host 2 instances of the same JobVertex at the 
> same time. Shared slot status: \{A1\}
> 2. A1 produces data and triggers the scheduling of B1. Shared slot status: 
> \{A1, B1\}
> 3. A1 finishes. Shared slot status: \{B1\}
> 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched 
> due to no physical slot becomes available, even though the shred slot is 
> qualified to host it now. A resource deadlock happens.
> Maybe we should improve {{SlotSharingManager}}. One a task slot is released, 
> its root {{MultiTaskSlot}} should be used to try fulfilling existing pending 
> task slots from other pending root slots({{unresolvedRootSlots}}) in this 
> {{SlotSharingManager}}(means in the same slot sharing group).
> We need to be careful to not cause any failures, and do not violate 
> colocation constraints.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14607) SharedSlot cannot fulfill pending slot requests before it's completely released

2019-11-05 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967449#comment-16967449
 ] 

Zhu Zhu commented on FLINK-14607:
-

With FLIP-53, it may happen for a *logical region* in a batch job, if it 
contains *PIPELINED* edges, runs with fewer slots than its parallelism. So it 
might be worse if a job has multiple such *logical regions*.
However, it's not a problem for streaming jobs, nor for batch jobs with all 
edges BLOCKING.

I also have concerns that it's not easy work and is possible to introduce 
severe bugs.
So I think it's fine to de-prioritize it if we have other ways to avoid it in 
near versions.

> SharedSlot cannot fulfill pending slot requests before it's completely 
> released
> ---
>
> Key: FLINK-14607
> URL: https://issues.apache.org/jira/browse/FLINK-14607
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.9.1
>Reporter: Zhu Zhu
>Priority: Major
>
> Currently a pending request can only be fulfilled when a physical 
> slot({{AllocatedSlot}}) becomes available in {{SlotPool}}.
> A shared slot however, cannot be used to fulfill pending requests even if it 
> becomes qualified. This may lead to resource deadlocks in certain cases.
> For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) 
> with 1 slot only, all vertices are in the same slot sharing group, here's 
> what may happen:
> 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is 
> pending because a slot cannot host 2 instances of the same JobVertex at the 
> same time. Shared slot status: \{A1\}
> 2. A1 produces data and triggers the scheduling of B1. Shared slot status: 
> \{A1, B1\}
> 3. A1 finishes. Shared slot status: \{B1\}
> 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched 
> due to no physical slot becomes available, even though the shred slot is 
> qualified to host it now. A resource deadlock happens.
> Maybe we should improve {{SlotSharingManager}}. One a task slot is released, 
> its root {{MultiTaskSlot}} should be used to try fulfilling existing pending 
> task slots from other pending root slots({{unresolvedRootSlots}}) in this 
> {{SlotSharingManager}}(means in the same slot sharing group).
> We need to be careful to not cause any failures, and do not violate 
> colocation constraints.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14607) SharedSlot cannot fulfill pending slot requests before it's completely released

2019-11-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967431#comment-16967431
 ] 

Till Rohrmann commented on FLINK-14607:
---

Thanks for creating this issue [~zhuzh]. I think your analysis is correct and 
the described situation can happen. I think this is relevant when running batch 
jobs with fewer slots than parallelism.

I'm not sure how easy it is to fix this problem. In particular, as we don't 
have certain information like the co-location constraint in the 
{{SlotSharingManager}}.

In general, I'm not too happy with the current implementation of slot sharing 
which supports the lazy assignment of tasks to slot sharing groups. I think 
this overcomplicates things and I would much rather move in the direction of a 
pipelined region scheduler which assigns static slot sharing groups at 
scheduling time. Hence, I'm not sure whether we should invest much effort into 
the {{SlotSharingManager}} as it might be gone soonish.

> SharedSlot cannot fulfill pending slot requests before it's completely 
> released
> ---
>
> Key: FLINK-14607
> URL: https://issues.apache.org/jira/browse/FLINK-14607
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.9.1
>Reporter: Zhu Zhu
>Priority: Major
>
> Currently a pending request can only be fulfilled when a physical 
> slot({{AllocatedSlot}}) becomes available in {{SlotPool}}.
> A shared slot however, cannot be used to fulfill pending requests even if it 
> becomes qualified. This may lead to resource deadlocks in certain cases.
> For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) 
> with 1 slot only, all vertices are in the same slot sharing group, here's 
> what may happen:
> 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is 
> pending because a slot cannot host 2 instances of the same JobVertex at the 
> same time. Shared slot status: \{A1\}
> 2. A1 produces data and triggers the scheduling of B1. Shared slot status: 
> \{A1, B1\}
> 3. A1 finishes. Shared slot status: \{B1\}
> 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched 
> due to no physical slot becomes available, even though the shred slot is 
> qualified to host it now. A resource deadlock happens.
> Maybe we should improve {{SlotSharingManager}}. One a task slot is released, 
> its root {{MultiTaskSlot}} should be used to try fulfilling existing pending 
> task slots from other pending root slots({{unresolvedRootSlots}}) in this 
> {{SlotSharingManager}}(means in the same slot sharing group).
> We need to be careful to not cause any failures, and do not violate 
> colocation constraints.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)