[
https://issues.apache.org/jira/browse/FLINK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219150#comment-17219150
]
Yuan Mei commented on FLINK-19774:
--
Here is an example from [~roman_khachatryan]
# downstream fails, reconnects, Netty Thread1 on upstream enters
{{createReadView}}
# Netty Thread1 gets stuck (e.g. unscheduled by OS)
# downstream fails again, reconnects, Netty Thread2 on upstream enters
{{createReadView}} and proceeds
# Netty Thread2 sends some data
# Netty Thread1 wakes up and continues in {{createReadView}}, releasing the
view and corrupting subpartition state
> Introduce sub partition view version for approximate Failover
> -
>
> Key: FLINK-19774
> URL: https://issues.apache.org/jira/browse/FLINK-19774
> Project: Flink
> Issue Type: Sub-task
>Reporter: Yuan Mei
>Priority: Major
>
>
> This ticket is to solve a corner case where a downstream task continuously
> fails multiple times, or an orphan task execution may exist for a short
> period of time after new execution is running (as described in the FLIP)
>
> Here is an idea of how to cleanly and thoroughly solve this kind of problem:
> # We go with the simplified release view version: only release view before a
> new creation (in thread2). That says we won't clean up view when downstream
> task disconnects ({{releaseView}} would not be called from the reference copy
> of view) (in thread1 or 2).
> *
> ** This would greatly simplify the threading model
> ** This won't cause any resource leak, since view release is only to notify
> the upstream result partition to releaseOnConsumption when all subpartitions
> are consumed in PipelinedSubPartitionView. In our case, we do not release the
> result partition on consumption any way (the result partition is put in track
> in JobMaster, similar to the ResultParition.blocking Type).
> 2. Each view is associated with a downstream task execution version
> *
> ** This is making sense because we actually have different versions of view
> now, corresponding to the vertex.version of the downstream task.
> ** createView is performed only if the new version to create is greater than
> the existing one
> ** If we decide to create a new view, the old view's parent (subpartition)
> is set --> invalid
> I think this way, we can completely disconnect the old view with the
> subpartition. Besides that, the working handler in use would always hold the
> freshest view reference.
>
> Point 1 has already been addressed in FLINK-19632. This ticket is to address
> Point 2.
> Details discussion in [https://github.com/apache/flink/pull/13648]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)