subject:"\[jira\] \[Commented\] \(FLINK\-19774\) Introduce sub partition view version for approximate Failover"

[jira] [Commented] (FLINK-19774) Introduce sub partition view version for approximate Failover

2020-10-23 Thread Yuan Mei (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219524#comment-17219524
 ] 

Yuan Mei commented on FLINK-19774:
--

Places need to be changed:

1. set the parent of view -> invalid

2. view is released before set to null in subpartition(done)

 

> Introduce sub partition view version for approximate Failover
> -
>
> Key: FLINK-19774
> URL: https://issues.apache.org/jira/browse/FLINK-19774
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yuan Mei
>Priority: Major
>
>  
> This ticket is to solve a corner case where a downstream task continuously 
> fails multiple times, or an orphan task execution may exist for a short 
> period of time after new execution is running (as described in the FLIP)
>  
> Here is an idea of how to cleanly and thoroughly solve this kind of problem:
>  # We go with the simplified release view version: only release view before a 
> new creation (in thread2). That says we won't clean up view when downstream 
> task disconnects ({{releaseView}} would not be called from the reference copy 
> of view) (in thread1 or 2).
>  * 
>  ** This would greatly simplify the threading model
>  ** This won't cause any resource leak, since view release is only to notify 
> the upstream result partition to releaseOnConsumption when all subpartitions 
> are consumed in PipelinedSubPartitionView. In our case, we do not release the 
> result partition on consumption any way (the result partition is put in track 
> in JobMaster, similar to the ResultParition.blocking Type).
>       2. Each view is associated with a downstream task execution version
>  * 
>  ** This is making sense because we actually have different versions of view 
> now, corresponding to the vertex.version of the downstream task.
>  ** createView is performed only if the new version to create is greater than 
> the existing one
>  ** If we decide to create a new view, the old view should be released.
> I think this way, we can completely disconnect the old view with the 
> subpartition. Besides that, the working handler in use would always hold the 
> freshest view reference.
>  
> Point 1 has already been addressed in FLINK-19632. This ticket is to address 
> Point 2.
> Details discussion in [https://github.com/apache/flink/pull/13648]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19774) Introduce sub partition view version for approximate Failover

2020-10-22 Thread Yuan Mei (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219150#comment-17219150
 ] 

Yuan Mei commented on FLINK-19774:
--

Here is an example from [~roman_khachatryan]
 # downstream fails, reconnects, Netty Thread1 on upstream enters 
{{createReadView}}
 # Netty Thread1 gets stuck (e.g. unscheduled by OS)
 # downstream fails again, reconnects, Netty Thread2 on upstream enters 
{{createReadView}} and proceeds
 # Netty Thread2 sends some data
 # Netty Thread1 wakes up and continues in {{createReadView}}, releasing the 
view and corrupting subpartition state

> Introduce sub partition view version for approximate Failover
> -
>
> Key: FLINK-19774
> URL: https://issues.apache.org/jira/browse/FLINK-19774
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yuan Mei
>Priority: Major
>
>  
> This ticket is to solve a corner case where a downstream task continuously 
> fails multiple times, or an orphan task execution may exist for a short 
> period of time after new execution is running (as described in the FLIP)
>  
> Here is an idea of how to cleanly and thoroughly solve this kind of problem:
>  # We go with the simplified release view version: only release view before a 
> new creation (in thread2). That says we won't clean up view when downstream 
> task disconnects ({{releaseView}} would not be called from the reference copy 
> of view) (in thread1 or 2).
>  * 
>  ** This would greatly simplify the threading model
>  ** This won't cause any resource leak, since view release is only to notify 
> the upstream result partition to releaseOnConsumption when all subpartitions 
> are consumed in PipelinedSubPartitionView. In our case, we do not release the 
> result partition on consumption any way (the result partition is put in track 
> in JobMaster, similar to the ResultParition.blocking Type).
>       2. Each view is associated with a downstream task execution version
>  * 
>  ** This is making sense because we actually have different versions of view 
> now, corresponding to the vertex.version of the downstream task.
>  ** createView is performed only if the new version to create is greater than 
> the existing one
>  ** If we decide to create a new view, the old view's parent (subpartition) 
> is set --> invalid
> I think this way, we can completely disconnect the old view with the 
> subpartition. Besides that, the working handler in use would always hold the 
> freshest view reference.
>  
> Point 1 has already been addressed in FLINK-19632. This ticket is to address 
> Point 2.
> Details discussion in [https://github.com/apache/flink/pull/13648]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19774) Introduce sub partition view version for approximate Failover

[jira] [Commented] (FLINK-19774) Introduce sub partition view version for approximate Failover

2 matches

Site Navigation

Mail list logo

Footer information