[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896835#comment-16896835 ] Zhu Zhu commented on FLINK-13421: - Hi [~till.rohrmann], the PR is opened. The solution is a bit different from what was proposed. Only the SlotSharingManager#listResolvedRootSlotInfo is changed. SlotSharingManager#getUnresolvedRootSlot and MultiTaskSlot#release are not changed as they are not necessary or may cause some other inconsistency issues. > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Blocker > Labels: pull-request-available > Fix For: 1.9.0 > > Attachments: jm_concurrentmodification.log > > Time Spent: 10m > Remaining Estimate: 0h > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895929#comment-16895929 ] Till Rohrmann commented on FLINK-13421: --- Hi [~zhuzh], what's the state of this issue? > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Blocker > Fix For: 1.9.0 > > Attachments: jm_concurrentmodification.log > > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894968#comment-16894968 ] Zhu Zhu commented on FLINK-13421: - Work in progress. > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Blocker > Fix For: 1.9.0 > > Attachments: jm_concurrentmodification.log > > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893646#comment-16893646 ] Till Rohrmann commented on FLINK-13421: --- Thanks a lot for the investigation [~zhuzh]. I've assigned the ticket to you. > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Blocker > Fix For: 1.9.0 > > Attachments: jm_concurrentmodification.log > > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893402#comment-16893402 ] Zhu Zhu commented on FLINK-13421: - We observed in when running stability test on Flink 1.9. The log is attached. Seems it is not a newly introduced issue. So maybe it is not a blocker for 1.9. > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Priority: Blocker > Fix For: 1.9.0 > > Attachments: jm_concurrentmodification.log > > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893397#comment-16893397 ] Till Rohrmann commented on FLINK-13421: --- Did you observe this while testing Flink or was it a test which failed? > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Priority: Blocker > Fix For: 1.9.0 > > Attachments: jm_concurrentmodification.log > > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893396#comment-16893396 ] Till Rohrmann commented on FLINK-13421: --- That is a good finding [~zhuzh]. I need to take a closer look to verify your solution proposal. > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Priority: Blocker > Fix For: 1.9.0 > > Attachments: jm_concurrentmodification.log > > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892776#comment-16892776 ] Zhu Zhu commented on FLINK-13421: - I think I find the cause. ConcurrentModificationException happened in the children release loop in _*MultiTaskSlot#release*_. It happened because new slot allocations happened, adding a new child to the MultiTaskSlot, thus modified the *_children_* field. New slot allocations happened because a task failover is caused by a child slot's payload releasing. The failover canceled tasks and returned allocated slots to SlotPool. The returned allocated slots were assigned to some tasks that were not canceled yet. The slot assignment satisfies the _*input location preference constraints*_ of some downstream tasks, and triggered the slot allocation of these tasks. To fix this, I think we may change _*MultiTaskSlot*_ to be not available for slot allocation(in SlotSharingManager#listResolvedRootSlotInfo and SlotSharingManager#getUnresolvedRootSlot) when it is in _*releasingChildren*_ state. Besides, changing _*MultiTaskSlot#release*_ to not iterate on its _*children*_ field directly would be better to avoid ConcurrentModificationException to happen in any case. Hi [~till.rohrmann], do you have any suggestion for it? > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Priority: Blocker > Attachments: jm_concurrentmodification.log > > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > It is not figured out yet that how a concurrent modification can happen. We > do not have a debug log for it and is not able to re-pro it with debug log > enabled yet. > However, we can let SlotSharingManager$MultiTaskSlot do not iterate on its > children directly to avoid ConcurrentModificationException to occur in any > case. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13421) Unexpected ConcurrentModificationException when RM notifies JM about allocation failure
[ https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892573#comment-16892573 ] Zhu Zhu commented on FLINK-13421: - All job modification is happening in the main thread. Therefore the ConcurrentModificationException seems to happen due to MultiTaskSlot children getting changed in the MultiTaskSlot's releasing iteration loop. > Unexpected ConcurrentModificationException when RM notifies JM about > allocation failure > --- > > Key: FLINK-13421 > URL: https://issues.apache.org/jira/browse/FLINK-13421 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Priority: Blocker > Attachments: jm_concurrentmodification.log > > > When a TM lost and RM identified it first, it will notify JM about it through > JobMaster#notifyAllocationFailure. > We observed unexpected ConcurrentModificationException in this process, stack > as below: > > Caused by: java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$ValueIterator.next(HashMap.java:1466) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477) > at > org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692) > at > org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538) > at > org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664) > ... 26 more > > This can cause an allocated slot to be removed from SlotPool#allocatedSlots > but not all of its payload tasks get failed. Tasks may hang in scheduled > forever, not able to fail or timeout, as in the attached log. > It is not figured out yet that how a concurrent modification can happen. We > do not have a debug log for it and is not able to re-pro it with debug log > enabled yet. > However, we can let SlotSharingManager$MultiTaskSlot do not iterate on its > children directly to avoid ConcurrentModificationException to occur in any > case. -- This message was sent by Atlassian JIRA (v7.6.14#76016)