[jira] [Commented] (FLINK-10968) Implement TaskManager Entrypoint

2018-11-27 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700812#comment-16700812
 ] 

JIN SUN commented on FLINK-10968:
-

Yes, it will extend from TaskManagerRunner and handle parameters/arguments.

> Implement TaskManager Entrypoint
> 
>
> Key: FLINK-10968
> URL: https://issues.apache.org/jira/browse/FLINK-10968
> Project: Flink
>  Issue Type: Sub-task
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>
> implement the main() entrypoint to start task manager pod.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10968) Implement TaskManager Entrypoint

2018-11-21 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10968:
---

 Summary: Implement TaskManager Entrypoint
 Key: FLINK-10968
 URL: https://issues.apache.org/jira/browse/FLINK-10968
 Project: Flink
  Issue Type: Sub-task
Reporter: JIN SUN
Assignee: JIN SUN


implement the main() entrypoint to start task manager pod.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10934) Implement KubernetesJobClusterEntrypoint

2018-11-19 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN reassigned FLINK-10934:
---

Assignee: JIN SUN

> Implement KubernetesJobClusterEntrypoint
> 
>
> Key: FLINK-10934
> URL: https://issues.apache.org/jira/browse/FLINK-10934
> Project: Flink
>  Issue Type: Sub-task
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>
> * Implement KubernetesJobClusterEntrypoint



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10939) Add documents for Flink on native k8s

2018-11-19 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10939:
---

 Summary: Add documents for Flink on native k8s
 Key: FLINK-10939
 URL: https://issues.apache.org/jira/browse/FLINK-10939
 Project: Flink
  Issue Type: Sub-task
Reporter: JIN SUN
Assignee: JIN SUN






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10938) Enable Flink on native k8s E2E Tests in Travis CI

2018-11-19 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10938:
---

 Summary: Enable Flink on native k8s E2E Tests in Travis CI 
 Key: FLINK-10938
 URL: https://issues.apache.org/jira/browse/FLINK-10938
 Project: Flink
  Issue Type: Sub-task
Reporter: JIN SUN
Assignee: JIN SUN


Add E2E tests to verify Flink on K8s integration



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10937) Add scripts to create docker image for k8s

2018-11-19 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10937:
---

 Summary: Add scripts to create docker image for k8s
 Key: FLINK-10937
 URL: https://issues.apache.org/jira/browse/FLINK-10937
 Project: Flink
  Issue Type: Sub-task
Reporter: JIN SUN
Assignee: JIN SUN


Add script to build docker image for flink on native k8s. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10936) Implement Command line tools

2018-11-19 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10936:
---

 Summary: Implement Command line tools
 Key: FLINK-10936
 URL: https://issues.apache.org/jira/browse/FLINK-10936
 Project: Flink
  Issue Type: Sub-task
Reporter: JIN SUN
Assignee: JIN SUN


Implement command tools to start kubernetes sessions: 
 * k8s-session.sh to start and stop a session like we did in yarn-session.sh
 * customized command line that will be invoked by CliFrontEnd and ./bin/flink 
run to submit job to kubernetes cluster



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10935) Implement KubeClient with Faric8 Kubernetes clients

2018-11-19 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN reassigned FLINK-10935:
---

Assignee: JIN SUN

> Implement KubeClient with Faric8 Kubernetes clients 
> 
>
> Key: FLINK-10935
> URL: https://issues.apache.org/jira/browse/FLINK-10935
> Project: Flink
>  Issue Type: Sub-task
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>
> Implement KubeClient with Faric8 Kubernetes clients and add tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10935) Implement KubeClient with Faric8 Kubernetes clients

2018-11-19 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10935:
---

 Summary: Implement KubeClient with Faric8 Kubernetes clients 
 Key: FLINK-10935
 URL: https://issues.apache.org/jira/browse/FLINK-10935
 Project: Flink
  Issue Type: Sub-task
Reporter: JIN SUN


Implement KubeClient with Faric8 Kubernetes clients and add tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10934) Implement KubernetesJobClusterEntrypoint

2018-11-19 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10934:
---

 Summary: Implement KubernetesJobClusterEntrypoint
 Key: FLINK-10934
 URL: https://issues.apache.org/jira/browse/FLINK-10934
 Project: Flink
  Issue Type: Sub-task
Reporter: JIN SUN


* Implement KubernetesJobClusterEntrypoint



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10933) Implement KubernetesSessionClusterEntrypoint

2018-11-19 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10933:
---

 Summary: Implement KubernetesSessionClusterEntrypoint
 Key: FLINK-10933
 URL: https://issues.apache.org/jira/browse/FLINK-10933
 Project: Flink
  Issue Type: Sub-task
  Components: ResourceManager
Reporter: JIN SUN
Assignee: JIN SUN


* Implement KubernetesSessionClusterEntrypoint
 * Implement TaskManager Entrypoint



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-9495) Implement ResourceManager for Kubernetes

2018-11-19 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN reassigned FLINK-9495:
--

Assignee: JIN SUN

> Implement ResourceManager for Kubernetes
> 
>
> Key: FLINK-9495
> URL: https://issues.apache.org/jira/browse/FLINK-9495
> Project: Flink
>  Issue Type: Sub-task
>  Components: ResourceManager
>Affects Versions: 1.5.0
>Reporter: Elias Levy
>Assignee: JIN SUN
>Priority: Major
>
> I noticed there is no issue for developing a Kubernetes specific 
> ResourceManager under FLIP-6, so I am creating this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10932) Initial flink-kubernetes module with empty implementation

2018-11-19 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10932:
---

 Summary: Initial flink-kubernetes module with empty implementation
 Key: FLINK-10932
 URL: https://issues.apache.org/jira/browse/FLINK-10932
 Project: Flink
  Issue Type: Sub-task
  Components: ResourceManager
Reporter: JIN SUN
Assignee: JIN SUN


Initial the skeleton module to start native kubernetes integration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-9955) Kubernetes ClusterDescriptor

2018-11-19 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN reassigned FLINK-9955:
--

Assignee: JIN SUN

> Kubernetes ClusterDescriptor
> 
>
> Key: FLINK-9955
> URL: https://issues.apache.org/jira/browse/FLINK-9955
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client
>Reporter: Till Rohrmann
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.8.0
>
>
> In order to start programmatically a Flink cluster on Kubernetes, we need a 
> {{KubernetesClusterDescriptor}} implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10644) Batch Job: Speculative execution

2018-11-06 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10644:

Description: 
Strugglers/outlier are tasks that run slower than most of the all tasks in a 
Batch Job, this somehow impact job latency, as pretty much this straggler will 
be in the critical path of the job and become as the bottleneck.

Tasks may be slow for various reasons, including hardware degradation, or 
software mis-configuration, or noise neighboring. It's hard for JM to predict 
the runtime.

To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
has *_speculative execution_*. Speculative execution is a health-check 
procedure that checks for tasks to be speculated, i.e. running slower in a 
ExecutionJobVertex than the median of all successfully completed tasks in that 
EJV, Such slow tasks will be re-submitted to another TM. It will not stop the 
slow tasks, but run a new copy in parallel. And will kill the others if one of 
them complete.

This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
append later.

  was:
Strugglers/outlier are tasks that run slower than most of the all tasks in a 
Batch Job, this somehow impact job latency, as pretty much this straggler will 
be in the critical path of the job and become as the bottleneck.

Tasks may be slow for various reasons, including hardware degradation, or 
software mis-configuration, or noise neighboring. It's hard for JM to predict 
the runtime.

To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
has *_speculative execution_*. Speculative execution is a health-check 
procedure that checks for tasks to be speculated, i.e. running slower in a 
ExecutionJobVertex than the median of all successfully completed tasks in that 
EJV, Such slow tasks will be re-submitted to another TM. It will not stop the 
slow tasks, but run a new copy in parallel. And will kill the others if one of 
them complete.

This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
append later.

 

the document contribute by is here: 
[https://docs.google.com/document/d/1X_Pfo4WcO-TEZmmVTTYNn44LQg5gnFeeaeqM7ZNLQ7M/edit]
 


> Batch Job: Speculative execution
> 
>
> Key: FLINK-10644
> URL: https://issues.apache.org/jira/browse/FLINK-10644
> Project: Flink
>  Issue Type: New Feature
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.8.0
>
>
> Strugglers/outlier are tasks that run slower than most of the all tasks in a 
> Batch Job, this somehow impact job latency, as pretty much this straggler 
> will be in the critical path of the job and become as the bottleneck.
> Tasks may be slow for various reasons, including hardware degradation, or 
> software mis-configuration, or noise neighboring. It's hard for JM to predict 
> the runtime.
> To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
> has *_speculative execution_*. Speculative execution is a health-check 
> procedure that checks for tasks to be speculated, i.e. running slower in a 
> ExecutionJobVertex than the median of all successfully completed tasks in 
> that EJV, Such slow tasks will be re-submitted to another TM. It will not 
> stop the slow tasks, but run a new copy in parallel. And will kill the others 
> if one of them complete.
> This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
> append later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10644) Batch Job: Speculative execution

2018-11-06 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10644:

Description: 
Strugglers/outlier are tasks that run slower than most of the all tasks in a 
Batch Job, this somehow impact job latency, as pretty much this straggler will 
be in the critical path of the job and become as the bottleneck.

Tasks may be slow for various reasons, including hardware degradation, or 
software mis-configuration, or noise neighboring. It's hard for JM to predict 
the runtime.

To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
has *_speculative execution_*. Speculative execution is a health-check 
procedure that checks for tasks to be speculated, i.e. running slower in a 
ExecutionJobVertex than the median of all successfully completed tasks in that 
EJV, Such slow tasks will be re-submitted to another TM. It will not stop the 
slow tasks, but run a new copy in parallel. And will kill the others if one of 
them complete.

This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
append later.

 

the document contribute by is here: 
[https://docs.google.com/document/d/1X_Pfo4WcO-TEZmmVTTYNn44LQg5gnFeeaeqM7ZNLQ7M/edit]
 

  was:
Strugglers/outlier are tasks that run slower than most of the all tasks in a 
Batch Job, this somehow impact job latency, as pretty much this straggler will 
be in the critical path of the job and become as the bottleneck. 

Tasks may be slow for various reasons, including hardware degradation, or 
software mis-configuration, or noise neighboring. It's hard for JM to predict 
the runtime. 

To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
has *_speculative execution_*. Speculative execution is a health-check 
procedure that checks for tasks to be speculated, i.e. running slower in a 
ExecutionJobVertex than the median of all successfully completed tasks in that 
EJV, Such slow tasks will be re-submitted to another TM. It will not stop the 
slow tasks, but run a new copy in parallel. And will kill the others if one of 
them complete. 

This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
append later.


> Batch Job: Speculative execution
> 
>
> Key: FLINK-10644
> URL: https://issues.apache.org/jira/browse/FLINK-10644
> Project: Flink
>  Issue Type: New Feature
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.8.0
>
>
> Strugglers/outlier are tasks that run slower than most of the all tasks in a 
> Batch Job, this somehow impact job latency, as pretty much this straggler 
> will be in the critical path of the job and become as the bottleneck.
> Tasks may be slow for various reasons, including hardware degradation, or 
> software mis-configuration, or noise neighboring. It's hard for JM to predict 
> the runtime.
> To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
> has *_speculative execution_*. Speculative execution is a health-check 
> procedure that checks for tasks to be speculated, i.e. running slower in a 
> ExecutionJobVertex than the median of all successfully completed tasks in 
> that EJV, Such slow tasks will be re-submitted to another TM. It will not 
> stop the slow tasks, but run a new copy in parallel. And will kill the others 
> if one of them complete.
> This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
> append later.
>  
> the document contribute by is here: 
> [https://docs.google.com/document/d/1X_Pfo4WcO-TEZmmVTTYNn44LQg5gnFeeaeqM7ZNLQ7M/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9953) Active Kubernetes integration

2018-10-30 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-9953:
---
Description: 
This is the umbrella issue tracking Flink's active Kubernetes integration. 
Active means in this context that the {{ResourceManager}} can talk to 
Kubernetes to launch new pods similar to Flink's Yarn and Mesos integration.

 

Document can be found here: 
[https://docs.google.com/document/d/1Zmhui_29VASPcBOEqyMWnF3L6WEWZ4kedrCqya0WaAk/edit?usp=sharing]
 

  was:This is the umbrella issue tracking Flink's active Kubernetes 
integration. Active means in this context that the {{ResourceManager}} can talk 
to Kubernetes to launch new pods similar to Flink's Yarn and Mesos integration.


> Active Kubernetes integration
> -
>
> Key: FLINK-9953
> URL: https://issues.apache.org/jira/browse/FLINK-9953
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination, ResourceManager
>Reporter: Till Rohrmann
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.7.0
>
>
> This is the umbrella issue tracking Flink's active Kubernetes integration. 
> Active means in this context that the {{ResourceManager}} can talk to 
> Kubernetes to launch new pods similar to Flink's Yarn and Mesos integration.
>  
> Document can be found here: 
> [https://docs.google.com/document/d/1Zmhui_29VASPcBOEqyMWnF3L6WEWZ4kedrCqya0WaAk/edit?usp=sharing]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10657) TPCHQuery3 fail with IllegalAccessException

2018-10-23 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10657:

Issue Type: Bug  (was: Improvement)

> TPCHQuery3 fail with IllegalAccessException
> ---
>
> Key: FLINK-10657
> URL: https://issues.apache.org/jira/browse/FLINK-10657
> Project: Flink
>  Issue Type: Bug
>  Components: E2E Tests
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Trivial
>  Labels: easy-fix, pull-request-available
> Fix For: 1.7.0
>
>
> Similar with [FLINK-7998|https://issues.apache.org/jira/browse/FLINK-7998], 
> ShoppingPriorityItem in example TPCHQuery3.java are set to private. This 
> causes an IllegalAccessException exception because of reflection check in 
> dynamic class instantiation. Making them public resolves the problem 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10657) TPCHQuery3 fail with IllegalAccessException

2018-10-23 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10657:
---

 Summary: TPCHQuery3 fail with IllegalAccessException
 Key: FLINK-10657
 URL: https://issues.apache.org/jira/browse/FLINK-10657
 Project: Flink
  Issue Type: Improvement
  Components: E2E Tests
Reporter: JIN SUN
Assignee: JIN SUN
 Fix For: 1.7.0


Similar with [FLINK-7998|https://issues.apache.org/jira/browse/FLINK-7998], 
ShoppingPriorityItem in example TPCHQuery3.java are set to private. This causes 
an IllegalAccessException exception because of reflection check in dynamic 
class instantiation. Making them public resolves the problem 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10656) Refactor org.apache.flink.runtime.io.network.api.reader.ReaderBase

2018-10-23 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10656:
---

 Summary: Refactor 
org.apache.flink.runtime.io.network.api.reader.ReaderBase
 Key: FLINK-10656
 URL: https://issues.apache.org/jira/browse/FLINK-10656
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Reporter: JIN SUN
Assignee: JIN SUN
 Fix For: 1.8.0


The interface of org.apache.flink.runtime.io.network.api.reader.ReaderBase is 
not very clean, the API in it are called only by iteration and handle event. 
which is not related the name ReaderBase. And the functionality is independent, 
so propose to change the name and split the interface to two isolated 
interface. 

More details please look at the PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10644) Batch Job: Speculative execution

2018-10-22 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10644:
---

 Summary: Batch Job: Speculative execution
 Key: FLINK-10644
 URL: https://issues.apache.org/jira/browse/FLINK-10644
 Project: Flink
  Issue Type: New Feature
  Components: JobManager
Reporter: JIN SUN
Assignee: JIN SUN
 Fix For: 1.8.0


Strugglers/outlier are tasks that run slower than most of the all tasks in a 
Batch Job, this somehow impact job latency, as pretty much this straggler will 
be in the critical path of the job and become as the bottleneck. 

Tasks may be slow for various reasons, including hardware degradation, or 
software mis-configuration, or noise neighboring. It's hard for JM to predict 
the runtime. 

To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
has *_speculative execution_*. Speculative execution is a health-check 
procedure that checks for tasks to be speculated, i.e. running slower in a 
ExecutionJobVertex than the median of all successfully completed tasks in that 
EJV, Such slow tasks will be re-submitted to another TM. It will not stop the 
slow tasks, but run a new copy in parallel. And will kill the others if one of 
them complete. 

This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
append later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10643) Bubble execution: Resource aware job execution

2018-10-22 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10643:
---

 Summary: Bubble execution: Resource aware job execution
 Key: FLINK-10643
 URL: https://issues.apache.org/jira/browse/FLINK-10643
 Project: Flink
  Issue Type: New Feature
  Components: JobManager
Reporter: JIN SUN
Assignee: JIN SUN
 Fix For: 1.8.0
 Attachments: image-2018-10-22-16-28-32-355.png

Today Flink support various channels such as pipelined channel and blocking 
channel. Blocking channel indicate that data need to be persistent in a batch 
and then it can be consumed later, it also indicate that the downstream task 
cannot start to process data unless its producer finished and also downstream 
task will only depends on this intermediate partition instead of upstream 
tasks. 

By leverage this characteristic, Flink already support fine grain-failover 
which will build a failover region has reduce failover cost.  However, we can 
leverage this characteristic even more. As described by this 
[paper|http://www.vldb.org/pvldb/vol11/p746-yin.pdf] (VLDB 2018), *_Bubble 
Execution_* not only use this characteristic to implement fine-grain failover, 
but also use this to balance the resource utilization and job performance. As 
shown in the paper (also in the following chart), with 50% of the resource, it 
get 25% (0.75 speedup) average slow down for TPCH benchmark.

!image-2018-10-22-16-28-32-355.png!

This JIRA here is umbrella that try to apply the idea of this paper to FLINK.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10573) Support task revocation

2018-10-22 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659782#comment-16659782
 ] 

JIN SUN commented on FLINK-10573:
-

Thanks Zhijiang, i would like use this exception, i see the Jira is still open, 
do you have any update or patch? 

> Support task revocation
> ---
>
> Key: FLINK-10573
> URL: https://issues.apache.org/jira/browse/FLINK-10573
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.7.0
>
>
> In Batch Mode, When a downstream task has a partition missing failure, which 
> indicate the output of upstream task has been lost. To make the job success 
> we need to rerun the upstream task to reproduce the data, which we call task 
> revocation (revoke the success of upstream task)
> For revocation, we need to identify the partition missing issue, and it is 
> better to detect the missing partition accurately:
>  * Ideally, it makes things much easier if we get a specific exception 
> indicating that the data source is missing
>  * When a task got an IOException, it doesn’t mean the source data has 
> issues. It might also be related to target task, such as that the target task 
> has network issues.
>  * If multiple tasks cannot read the same source, it is highly likely the 
> source data is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10576) Introduce Machine/Node/TM health management

2018-10-16 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10576:
---

 Summary: Introduce Machine/Node/TM health management
 Key: FLINK-10576
 URL: https://issues.apache.org/jira/browse/FLINK-10576
 Project: Flink
  Issue Type: Improvement
  Components: ResourceManager
Reporter: JIN SUN
Assignee: JIN SUN
 Fix For: 1.8.0


When a task failed we can identify whether it was due to environment issues, 
especially when multiple tasks report environment error from some 
TM/Machine/Node, there are high possibility that this TM has issue, and if we 
found multiple tasks became slow in some certain node, we should put the 
machine into probation. 
 * we should avoid schedule new task to it
 * release the task manager when all tasks are drained and allocated new one if 
needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10575) Remove deprecated ExecutionGraphBuilder.buildGraph method

2018-10-16 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652542#comment-16652542
 ] 

JIN SUN commented on FLINK-10575:
-

[~till.rohrmann] does this sound reasonable? 

> Remove deprecated ExecutionGraphBuilder.buildGraph method
> -
>
> Key: FLINK-10575
> URL: https://issues.apache.org/jira/browse/FLINK-10575
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Affects Versions: 1.7.0
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>
> ExecutionGraphBuilder is not a public API and we should able to remove 
> deprecated method such as:
> @Deprecated
> public static ExecutionGraph buildGraph
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10575) Remove deprecated ExecutionGraphBuilder.buildGraph method

2018-10-16 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10575:
---

 Summary: Remove deprecated ExecutionGraphBuilder.buildGraph method
 Key: FLINK-10575
 URL: https://issues.apache.org/jira/browse/FLINK-10575
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager
Affects Versions: 1.7.0
Reporter: JIN SUN
Assignee: JIN SUN


ExecutionGraphBuilder is not a public API and we should able to remove 
deprecated method such as:

@Deprecated

public static ExecutionGraph buildGraph

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10574) Update documents of new batch job failover strategy

2018-10-16 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10574:
---

 Summary: Update documents of new batch job failover strategy
 Key: FLINK-10574
 URL: https://issues.apache.org/jira/browse/FLINK-10574
 Project: Flink
  Issue Type: Sub-task
Reporter: JIN SUN


In the new failover strategy, we have new configuration exposed. so we need to 
document it. Also there are some document about fault tolerant, such as 
[here|https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/batch/fault_tolerance.html],
 that we need to update. I preferer do it after the feature completed since we 
don’t want to expose it anywhere if the feature is not ready.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10574) Update documents of new batch job failover strategy

2018-10-16 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN reassigned FLINK-10574:
---

Assignee: JIN SUN

> Update documents of new batch job failover strategy
> ---
>
> Key: FLINK-10574
> URL: https://issues.apache.org/jira/browse/FLINK-10574
> Project: Flink
>  Issue Type: Sub-task
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>
> In the new failover strategy, we have new configuration exposed. so we need 
> to document it. Also there are some document about fault tolerant, such as 
> [here|https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/batch/fault_tolerance.html],
>  that we need to update. I preferer do it after the feature completed since 
> we don’t want to expose it anywhere if the feature is not ready.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10573) Support task revocation

2018-10-16 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10573:
---

 Summary: Support task revocation
 Key: FLINK-10573
 URL: https://issues.apache.org/jira/browse/FLINK-10573
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager
Reporter: JIN SUN
Assignee: JIN SUN
 Fix For: 1.7.0


In Batch Mode, When a downstream task has a partition missing failure, which 
indicate the output of upstream task has been lost. To make the job success we 
need to rerun the upstream task to reproduce the data, which we call task 
revocation (revoke the success of upstream task)

For revocation, we need to identify the partition missing issue, and it is 
better to detect the missing partition accurately:
 * Ideally, it makes things much easier if we get a specific exception 
indicating that the data source is missing
 * When a task got an IOException, it doesn’t mean the source data has issues. 
It might also be related to target task, such as that the target task has 
network issues.
 * If multiple tasks cannot read the same source, it is highly likely the 
source data is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10572) Enable Per-job level failover strategy.

2018-10-16 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10572:
---

 Summary: Enable Per-job level failover strategy. 
 Key: FLINK-10572
 URL: https://issues.apache.org/jira/browse/FLINK-10572
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager
Reporter: JIN SUN
Assignee: JIN SUN
 Fix For: 1.7.0


Today we can specify ExecutionMode in ExecutionConfig, its a per-job setting. 
However,  FailoverStrategy is a cluster-level configuration, while it should be 
per-job:
 * The FailoverStrategy has dependencies with ExecutionMode in ExecutionConfig, 
such as Pipelined ExecutionMode doesn't compatible with 
RestartIndividualStrategy, so set it as cluster-level doesn't make sense.
 *  The FailoverStrategy also has dependencies with RestartStrategy. Like in 
the new Batch failover strategy, instead of keep on restarting, we want to fail 
the job if certain condition met, as a result, a NoRestart or some new Restart 
strategy should be configured. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10534) Add idle timeout for a flink session cluster

2018-10-11 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16647372#comment-16647372
 ] 

JIN SUN commented on FLINK-10534:
-

session can have the semantic of expire time. we probably can add a parameter 
to to specify the expiry time while create the session (maybe in command line), 
the default value can be infinite to keep current behavior. 

 

> Add idle timeout for a flink session cluster
> 
>
> Key: FLINK-10534
> URL: https://issues.apache.org/jira/browse/FLINK-10534
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Affects Versions: 1.7.0
>Reporter: ouyangzhe
>Assignee: vinoyang
>Priority: Major
> Attachments: 屏幕快照 2018-10-12 上午10.24.08.png
>
>
> The flink session cluster on yarn will aways be running while has no jobs 
> running at all, it will occupy the yarn resources for no use.
> Taskmanagers will be released after an idle timeout, but jobmanager will keep 
> running.
> I propose to add a configuration to limit the idle timeout for jobmanager 
> too, if no job running after a specified timeout, the flink cluster auto 
> finish itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-10494) Rename 'JobManager' to 'JobMaster' for some classes in JobMaster folder

2018-10-11 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN closed FLINK-10494.
---
Resolution: Won't Fix

closed as suggestion

> Rename 'JobManager' to 'JobMaster' for some classes in JobMaster folder
> ---
>
> Key: FLINK-10494
> URL: https://issues.apache.org/jira/browse/FLINK-10494
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Affects Versions: 1.6.0, 1.7.0
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> Some names in 
> "flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster" folder are 
> confusing, we should rename it to JobMaster.
>  
>  * JobManagerRunner -> JobMasterRunner
>  * JobManagerGateway -> JobMasterGateway
>  * JobManagerSharedServices -> JobMasterSharedServices
>  * JobManagerException -> JobMasterException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9953) Active Kubernetes integration

2018-10-09 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643757#comment-16643757
 ] 

JIN SUN commented on FLINK-9953:


Thanks Till.

> Active Kubernetes integration
> -
>
> Key: FLINK-9953
> URL: https://issues.apache.org/jira/browse/FLINK-9953
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination, ResourceManager
>Reporter: Till Rohrmann
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.7.0
>
>
> This is the umbrella issue tracking Flink's active Kubernetes integration. 
> Active means in this context that the {{ResourceManager}} can talk to 
> Kubernetes to launch new pods similar to Flink's Yarn and Mesos integration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10404) Declarative resource management

2018-10-08 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642640#comment-16642640
 ] 

JIN SUN commented on FLINK-10404:
-

Glad to see the design document, will take a look and comment.

> Declarative resource management
> ---
>
> Key: FLINK-10404
> URL: https://issues.apache.org/jira/browse/FLINK-10404
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> This is the umbrella issue to track the progress on implementing declarative 
> resource management.
> Declarative resource management is a change to Flink's current slot 
> allocation protocol. Instead of letting the {{JobMaster}} ask for each slot 
> individually, it will tell the {{ResourceManager}} its current need (min, 
> target, max) of slots. Based on that, the {{ResourceManager}} will assign 
> available slots or start new {{TaskExecutors}} to fulfill the request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10498) Decouple ExecutionGraph from JobMaster

2018-10-08 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642635#comment-16642635
 ] 

JIN SUN commented on FLINK-10498:
-

this seems related to https://issues.apache.org/jira/browse/FLINK-10429 ?

> Decouple ExecutionGraph from JobMaster
> --
>
> Key: FLINK-10498
> URL: https://issues.apache.org/jira/browse/FLINK-10498
> Project: Flink
>  Issue Type: Sub-task
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> With declarative resource management we want to react to the set of available 
> resources. Thus, we need a component which is responsible for scaling the 
> {{ExecutionGraph}} accordingly. In order to better do this and separate 
> concerns, it is beneficial to introduce a {{Scheduler/ExecutionGraphDriver}} 
> component which is in charge of the {{ExecutionGraph}}. This component owns 
> the {{ExecutionGraph}} and is allowed to modify it. In the first version, 
> this component will simply accommodate all the existing logic of the 
> {{JobMaster}} and the respective {{JobMaster}} methods are forwarded to this 
> component. 
> This new component should not change the existing behaviour of Flink.
> Later this component will be in charge of announcing the required resources, 
> deciding when to rescale and executing the rescaling operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10392) Remove legacy mode

2018-10-04 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638905#comment-16638905
 ] 

JIN SUN commented on FLINK-10392:
-

Sounds good, i will try my best. :)

> Remove legacy mode
> --
>
> Key: FLINK-10392
> URL: https://issues.apache.org/jira/browse/FLINK-10392
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> This issue is the umbrella issue to remove the legacy mode code from Flink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10494) Rename 'JobManager' to 'JobMaster' for some classes in JobMaster folder

2018-10-04 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10494:

Summary: Rename 'JobManager' to 'JobMaster' for some classes in JobMaster 
folder  (was: Rename 'JobManager' to J'obMaster' for some classes in JobMaster 
folder)

> Rename 'JobManager' to 'JobMaster' for some classes in JobMaster folder
> ---
>
> Key: FLINK-10494
> URL: https://issues.apache.org/jira/browse/FLINK-10494
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Affects Versions: 1.6.0, 1.7.0
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Minor
> Fix For: 1.7.0
>
>
> Some names in 
> "flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster" folder are 
> confusing, we should rename it to JobMaster.
>  
>  * JobManagerRunner -> JobMasterRunner
>  * JobManagerGateway -> JobMasterGateway
>  * JobManagerSharedServices -> JobMasterSharedServices
>  * JobManagerException -> JobMasterException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10494) Rename 'JobManager' to J'obMaster' for some classes in JobMaster folder

2018-10-04 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10494:
---

 Summary: Rename 'JobManager' to J'obMaster' for some classes in 
JobMaster folder
 Key: FLINK-10494
 URL: https://issues.apache.org/jira/browse/FLINK-10494
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager
Affects Versions: 1.6.0, 1.7.0
Reporter: JIN SUN
Assignee: JIN SUN
 Fix For: 1.7.0


Some names in "flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster" 
folder are confusing, we should rename it to JobMaster.

 
 * JobManagerRunner -> JobMasterRunner
 * JobManagerGateway -> JobMasterGateway
 * JobManagerSharedServices -> JobMasterSharedServices
 * JobManagerException -> JobMasterException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10392) Remove legacy mode

2018-10-04 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638539#comment-16638539
 ] 

JIN SUN commented on FLINK-10392:
-

+1, can we also do the refactor by move the code to a folder "legacy" to avoid 
confuse new contributors. 

> Remove legacy mode
> --
>
> Key: FLINK-10392
> URL: https://issues.apache.org/jira/browse/FLINK-10392
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> This issue is the umbrella issue to remove the legacy mode code from Flink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-10-02 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10205:

Affects Version/s: 1.7.0

> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Affects Versions: 1.6.1, 1.7.0, 1.6.2
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  document:
> [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-10-02 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10205:

Fix Version/s: 1.7.0

> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Affects Versions: 1.6.1, 1.7.0, 1.6.2
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  document:
> [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10481) Wordcount end-to-end test in docker env unstable

2018-10-02 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636387#comment-16636387
 ] 

JIN SUN commented on FLINK-10481:
-

seems like a intermittent issue, might be docker repository issue, other vision 
has same problem. 

> Wordcount end-to-end test in docker env unstable
> 
>
> Key: FLINK-10481
> URL: https://issues.apache.org/jira/browse/FLINK-10481
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0
>
>
> The {{Wordcount end-to-end test in docker env}} fails sometimes on Travis 
> with the following problem:
> {code}
> Status: Downloaded newer image for java:8-jre-alpine
>  ---> fdc893b19a14
> Step 2/16 : RUN apk add --no-cache bash snappy
>  ---> [Warning] IPv4 forwarding is disabled. Networking will not work.
>  ---> Running in 4329ebcd8a77
> fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
> WARNING: Ignoring 
> http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz: 
> temporary error (try again later)
> fetch 
> http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
> WARNING: Ignoring 
> http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz: 
> temporary error (try again later)
> ERROR: unsatisfiable constraints:
>   bash (missing):
> required by: world[bash]
>   snappy (missing):
> required by: world[snappy]
> The command '/bin/sh -c apk add --no-cache bash snappy' returned a non-zero 
> code: 2
> {code}
> https://api.travis-ci.org/v3/job/434909395/log.txt
> It seems as if it is related to 
> https://github.com/gliderlabs/docker-alpine/issues/264 and 
> https://github.com/gliderlabs/docker-alpine/issues/279.
> We might want to switch to a different base image to avoid these problems in 
> the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-9953) Active Kubernetes integration

2018-10-02 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN reassigned FLINK-9953:
--

Assignee: JIN SUN

> Active Kubernetes integration
> -
>
> Key: FLINK-9953
> URL: https://issues.apache.org/jira/browse/FLINK-9953
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination, ResourceManager
>Reporter: Till Rohrmann
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.7.0
>
>
> This is the umbrella issue tracking Flink's active Kubernetes integration. 
> Active means in this context that the {{ResourceManager}} can talk to 
> Kubernetes to launch new pods similar to Flink's Yarn and Mesos integration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9953) Active Kubernetes integration

2018-09-28 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632653#comment-16632653
 ] 

JIN SUN commented on FLINK-9953:


[~till.rohrmann], is there anyone working on this? i would like to volunteer. 

> Active Kubernetes integration
> -
>
> Key: FLINK-9953
> URL: https://issues.apache.org/jira/browse/FLINK-9953
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination, ResourceManager
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> This is the umbrella issue tracking Flink's active Kubernetes integration. 
> Active means in this context that the {{ResourceManager}} can talk to 
> Kubernetes to launch new pods similar to Flink's Yarn and Mesos integration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9955) Kubernetes ClusterDescriptor

2018-09-28 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632652#comment-16632652
 ] 

JIN SUN commented on FLINK-9955:


[~till.rohrmann], is there anyone working on this? i would like to volunteer. 

> Kubernetes ClusterDescriptor
> 
>
> Key: FLINK-9955
> URL: https://issues.apache.org/jira/browse/FLINK-9955
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> In order to start programmatically a Flink cluster on Kubernetes, we need a 
> {{KubernetesClusterDescriptor}} implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-09-28 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10205:

Description: 
Today DataSource Task pull InputSplits from JobManager to achieve better 
performance, however, when a DataSourceTask failed and rerun, it will not get 
the same splits as its previous version. this will introduce inconsistent 
result or even data corruption.

Furthermore,  if there are two executions run at the same time (in batch 
scenario), this two executions should process same splits.

we need to fix the issue to make the inputs of a DataSourceTask deterministic. 
The propose is save all splits into ExecutionVertex and DataSourceTask will 
pull split from there.

 document:

[https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]

  was:
Today DataSource Task pull InputSplits from JobManager to achieve better 
performance, however, when a DataSourceTask failed and rerun, it will not get 
the same splits as its previous version. this will introduce inconsistent 
result or even data corruption.

Furthermore,  if there are two executions run at the same time (in batch 
scenario), this two executions should process same splits.

we need to fix the issue to make the inputs of a DataSourceTask deterministic. 
The propose is save all splits into ExecutionVertex and DataSourceTask will 
pull split from there.

 


> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Affects Versions: 1.6.1, 1.6.2
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  document:
> [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10429) Redesign Flink Scheduling, introducing dedicated Scheduler component

2018-09-26 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629571#comment-16629571
 ] 

JIN SUN commented on FLINK-10429:
-

+1 

I like the proposal, have put some comments in the document.  

> Redesign Flink Scheduling, introducing dedicated Scheduler component
> 
>
> Key: FLINK-10429
> URL: https://issues.apache.org/jira/browse/FLINK-10429
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
>
> This epic tracks the redesign of scheduling in Flink. Scheduling is currently 
> a concern that is scattered across different components, mainly the 
> ExecutionGraph/Execution and the SlotPool. Scheduling also happens only on 
> the granularity of individual tasks, which make holistic scheduling 
> strategies hard to implement. In this epic we aim to introduce a dedicated 
> Scheduler component that can support use-case like auto-scaling, 
> local-recovery, and resource optimized batch.
> The design for this feature is developed here: 
> https://docs.google.com/document/d/1q7NOqt05HIN-PlKEEPB36JiuU1Iu9fnxxVGJzylhsxU/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10288) Failover Strategy improvement

2018-09-26 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10288:

Fix Version/s: 1.7.0

> Failover Strategy improvement
> -
>
> Key: FLINK-10288
> URL: https://issues.apache.org/jira/browse/FLINK-10288
> Project: Flink
>  Issue Type: Improvement
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.7.0
>
>
> Flink pays significant efforts to make Streaming Job fault tolerant. The 
> checkpoint mechanism and exactly once semantics make Flink different than 
> other systems. However, there are still some cases not been handled very 
> well. Those cases can apply to both Streaming and Batch scenarios, and its 
> orthogonal with current fault tolerant mechanism. Here is a summary of those 
> cases:
>  # Some failures are non-recoverable, such as a user error: 
> DividebyZeroException. We shouldn't try to restart the task, as it will never 
> succeed. The DivideByZeroException is just a simple case, those errors 
> sometime are not easy to reproduce or predict, as it might be only triggered 
> by specific input data, we shouldn’t retry for all user code exceptions.
>  # There is no limit for task retry today, unless a SuppressRestartException 
> was encountered, a task will keep on retrying until it succeeds. As mentioned 
> above, we shouldn’t retry for some cases at all, and for the Exceptions we 
> can retry, such as a network exception, should we have a retry limit? We need 
> retry for any transient issue, but we also need to set a limit to avoid 
> infinite retry and resource wasting. For Batch and Streaming workload, we 
> might need different strategies.
>  # There are some exceptions due to hardware issues, such as disk/network 
> malfunction. when a task/TaskManager fail on this, we’d better detect and 
> avoid to schedule to that machine next time.
>  # If a task read from a blocking result partition, when its input is not 
> available, we can ‘revoke’ the produce task, set the task fail and rerun the 
> upstream task to regenerate data.  the revoke can propagate up through the 
> chain. In Spark, revoke is naturally support by lineage.
> To make fault tolerance easier, we need to keep deterministic behavior as 
> much as possible. For user code, it’s not easy to control. However, for 
> system related code, we can fix it. For example, we should at least make sure 
> the different attempt of a same task to have the same inputs (we have a bug 
> in current codebase (DataSourceTask) that cannot guarantee this). Note that 
> this is track by [Flink-10205]
> Details see this proposal:
> [https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10298) Batch Job Failover Strategy

2018-09-26 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10298:

Fix Version/s: 1.7.0

> Batch Job Failover Strategy
> ---
>
> Key: FLINK-10298
> URL: https://issues.apache.org/jira/browse/FLINK-10298
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
> Fix For: 1.7.0
>
>
> The new failover strategy needs to consider handling failures according to 
> different failure types. It orchestrates all the logics we mentioned in this 
> [document|https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit],
>  we can put the logic in onTaskFailure method of the FailoverStrategy 
> interface, with the logic inline:
> {code:java}
> public void onTaskFailure(Execution taskExecution, Throwable cause) {  
>     //1. Get the throwable type
>     //2. If the type is NonrecoverableType fail the job
>     //3. If the type is PatritionDataMissingError, do revocation
>     //4. If the type is EnvironmentError, do check blacklist
> //5. Other failure types are recoverable, but we need to remember the 
> count of the failure,
> //6. if it exceeds the threshold, fail the job
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10289) Classify Exceptions to different category for apply different failover strategy

2018-09-26 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10289:

Fix Version/s: 1.7.0

> Classify Exceptions to different category for apply different failover 
> strategy
> ---
>
> Key: FLINK-10289
> URL: https://issues.apache.org/jira/browse/FLINK-10289
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> We need to classify exceptions and treat them with different strategies. To 
> do this, we propose to introduce the following Throwable Types, and the 
> corresponding exceptions:
>  * NonRecoverable
>  ** We shouldn’t retry if an exception was classified as NonRecoverable
>  ** For example, NoResouceAvailiableException is a NonRecoverable Exception
>  ** Introduce a new Exception UserCodeException to wrap all exceptions that 
> throw from user code
>  * PartitionDataMissingError
>  ** In certain scenarios producer data was transferred in blocking mode or 
> data was saved in persistent store. If the partition was missing, we need to 
> revoke/rerun the produce task to regenerate the data.
>  ** Introduce a new exception PartitionDataMissingException to wrap all those 
> kinds of issues.
>  * EnvironmentError
>  ** It happened due to hardware, or software issues that were related to 
> specific environments. The assumption is that a task will succeed if we run 
> it in a different environment, and other task run in this bad environment 
> will very likely fail. If multiple task failures in the same machine due to 
> EnvironmentError, we need to consider adding the bad machine to blacklist, 
> and avoiding schedule task on it.
>  ** Introduce a new exception EnvironmentException to wrap all those kind of 
> issues.
>  * Recoverable
>  ** We assume other issues are recoverable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-09-26 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10205:

Affects Version/s: 1.6.2
   1.6.1

> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Affects Versions: 1.6.1, 1.6.2
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10298) Batch Job Failover Strategy

2018-09-21 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624217#comment-16624217
 ] 

JIN SUN commented on FLINK-10298:
-

Hi Tison,

Thanks for point this. Yes, i know there is a DataConsumptionException, and we 
can refer DataComsumptionException as PartitionDataMissingError, 
DataComsumptionException is one of the issue we need to handle. This JIRA is 
target for a framework to improve failover, especially in Batch job scenario.  

Jin

> Batch Job Failover Strategy
> ---
>
> Key: FLINK-10298
> URL: https://issues.apache.org/jira/browse/FLINK-10298
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>
> The new failover strategy needs to consider handling failures according to 
> different failure types. It orchestrates all the logics we mentioned in this 
> [document|https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit],
>  we can put the logic in onTaskFailure method of the FailoverStrategy 
> interface, with the logic inline:
> {code:java}
> public void onTaskFailure(Execution taskExecution, Throwable cause) {  
>     //1. Get the throwable type
>     //2. If the type is NonrecoverableType fail the job
>     //3. If the type is PatritionDataMissingError, do revocation
>     //4. If the type is EnvironmentError, do check blacklist
> //5. Other failure types are recoverable, but we need to remember the 
> count of the failure,
> //6. if it exceeds the threshold, fail the job
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-09-20 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622387#comment-16622387
 ] 

JIN SUN commented on FLINK-10205:
-

Hi Stephan,

I have the FLIP, and the pull request ready for this issue, could you help or 
ask somebody have a look?

https://github.com/apache/flink/pull/6684 


The FLIP is here:

https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing
 


Jin



> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10298) Batch Job Failover Strategy

2018-09-07 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10298:

Description: 
The new failover strategy needs to consider handling failures according to 
different failure types. It orchestrates all the logics we mentioned in this 
[document|https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit],
 we can put the logic in onTaskFailure method of the FailoverStrategy 
interface, with the logic inline:
{code:java}
public void onTaskFailure(Execution taskExecution, Throwable cause) {  

    //1. Get the throwable type

    //2. If the type is NonrecoverableType fail the job

    //3. If the type is PatritionDataMissingError, do revocation

    //4. If the type is EnvironmentError, do check blacklist

//5. Other failure types are recoverable, but we need to remember the 
count of the failure,

//6. if it exceeds the threshold, fail the job

}{code}

  was:
The new failover strategy needs to consider handling failures according to 
different failure types. It orchestrates all the logics we mentioned in this 
[document|https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit#],
 we can put the logic in onTaskFailure method of the FailoverStrategy 
interface, with the logic inline:
{code:java}
public void onTaskFailure(Execution taskExecution, Throwable cause) {  

    //1. Get the throwable type

    //2. If the type is NonrecoverableType fail the job

    //3. If the type is PatritionDataMissingError, do revocation

    //4. If the type is EnvironmentError, do check blacklist

//5. Other failure types are recoverable, but we need to remember the count of 
the failure,

if it exceeds the threshold, fail the job

}{code}


> Batch Job Failover Strategy
> ---
>
> Key: FLINK-10298
> URL: https://issues.apache.org/jira/browse/FLINK-10298
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>
> The new failover strategy needs to consider handling failures according to 
> different failure types. It orchestrates all the logics we mentioned in this 
> [document|https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit],
>  we can put the logic in onTaskFailure method of the FailoverStrategy 
> interface, with the logic inline:
> {code:java}
> public void onTaskFailure(Execution taskExecution, Throwable cause) {  
>     //1. Get the throwable type
>     //2. If the type is NonrecoverableType fail the job
>     //3. If the type is PatritionDataMissingError, do revocation
>     //4. If the type is EnvironmentError, do check blacklist
> //5. Other failure types are recoverable, but we need to remember the 
> count of the failure,
> //6. if it exceeds the threshold, fail the job
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10298) Batch Job Failover Strategy

2018-09-07 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10298:
---

 Summary: Batch Job Failover Strategy
 Key: FLINK-10298
 URL: https://issues.apache.org/jira/browse/FLINK-10298
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager
Reporter: JIN SUN
Assignee: JIN SUN


The new failover strategy needs to consider handling failures according to 
different failure types. It orchestrates all the logics we mentioned in this 
[document|https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit#],
 we can put the logic in onTaskFailure method of the FailoverStrategy 
interface, with the logic inline:
{code:java}
public void onTaskFailure(Execution taskExecution, Throwable cause) {  

    //1. Get the throwable type

    //2. If the type is NonrecoverableType fail the job

    //3. If the type is PatritionDataMissingError, do revocation

    //4. If the type is EnvironmentError, do check blacklist

//5. Other failure types are recoverable, but we need to remember the count of 
the failure,

if it exceeds the threshold, fail the job

}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10289) Classify Exceptions to different category for apply different failover strategy

2018-09-07 Thread JIN SUN (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JIN SUN updated FLINK-10289:

Description: 
We need to classify exceptions and treat them with different strategies. To do 
this, we propose to introduce the following Throwable Types, and the 
corresponding exceptions:
 * NonRecoverable
 ** We shouldn’t retry if an exception was classified as NonRecoverable
 ** For example, NoResouceAvailiableException is a NonRecoverable Exception
 ** Introduce a new Exception UserCodeException to wrap all exceptions that 
throw from user code

 * PartitionDataMissingError
 ** In certain scenarios producer data was transferred in blocking mode or data 
was saved in persistent store. If the partition was missing, we need to 
revoke/rerun the produce task to regenerate the data.
 ** Introduce a new exception PartitionDataMissingException to wrap all those 
kinds of issues.

 * EnvironmentError
 ** It happened due to hardware, or software issues that were related to 
specific environments. The assumption is that a task will succeed if we run it 
in a different environment, and other task run in this bad environment will 
very likely fail. If multiple task failures in the same machine due to 
EnvironmentError, we need to consider adding the bad machine to blacklist, and 
avoiding schedule task on it.
 ** Introduce a new exception EnvironmentException to wrap all those kind of 
issues.

 * Recoverable
 ** We assume other issues are recoverable.

  was:
We need to classify exceptions and treat them with different strategies. To do 
this, we propose to introduce the following Throwable Types, and the 
corresponding exceptions:
 * NonRecoverable
 * We shouldn’t retry if an exception was classified as NonRecoverable
 * For example, NoResouceAvailiableException is a NonRecoverable Exception
 * Introduce a new Exception UserCodeException to wrap all exceptions that 
throw from user code


 *  PartitionDataMissingError
 * In certain scenarios producer data was transferred in blocking mode or data 
was saved in persistent store. If the partition was missing, we need to 
revoke/rerun the produce task to regenerate the data.
 * Introduce a new exception PartitionDataMissingException to wrap all those 
kinds of issues.


 * EnvironmentError
 * It happened due to hardware, or software issues that were related to 
specific environments. The assumption is that a task will succeed if we run it 
in a different environment, and other task run in this bad environment will 
very likely fail. If multiple task failures in the same machine due to 
EnvironmentError, we need to consider adding the bad machine to blacklist, and 
avoiding schedule task on it.
 * Introduce a new exception EnvironmentException to wrap all those kind of 
issues.


 * Recoverable
 * We assume other issues are recoverable.


> Classify Exceptions to different category for apply different failover 
> strategy
> ---
>
> Key: FLINK-10289
> URL: https://issues.apache.org/jira/browse/FLINK-10289
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>
> We need to classify exceptions and treat them with different strategies. To 
> do this, we propose to introduce the following Throwable Types, and the 
> corresponding exceptions:
>  * NonRecoverable
>  ** We shouldn’t retry if an exception was classified as NonRecoverable
>  ** For example, NoResouceAvailiableException is a NonRecoverable Exception
>  ** Introduce a new Exception UserCodeException to wrap all exceptions that 
> throw from user code
>  * PartitionDataMissingError
>  ** In certain scenarios producer data was transferred in blocking mode or 
> data was saved in persistent store. If the partition was missing, we need to 
> revoke/rerun the produce task to regenerate the data.
>  ** Introduce a new exception PartitionDataMissingException to wrap all those 
> kinds of issues.
>  * EnvironmentError
>  ** It happened due to hardware, or software issues that were related to 
> specific environments. The assumption is that a task will succeed if we run 
> it in a different environment, and other task run in this bad environment 
> will very likely fail. If multiple task failures in the same machine due to 
> EnvironmentError, we need to consider adding the bad machine to blacklist, 
> and avoiding schedule task on it.
>  ** Introduce a new exception EnvironmentException to wrap all those kind of 
> issues.
>  * Recoverable
>  ** We assume other issues are recoverable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10289) Classify Exceptions to different category for apply different failover strategy

2018-09-06 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10289:
---

 Summary: Classify Exceptions to different category for apply 
different failover strategy
 Key: FLINK-10289
 URL: https://issues.apache.org/jira/browse/FLINK-10289
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager
Reporter: JIN SUN
Assignee: JIN SUN


We need to classify exceptions and treat them with different strategies. To do 
this, we propose to introduce the following Throwable Types, and the 
corresponding exceptions:
 * NonRecoverable
 * We shouldn’t retry if an exception was classified as NonRecoverable
 * For example, NoResouceAvailiableException is a NonRecoverable Exception
 * Introduce a new Exception UserCodeException to wrap all exceptions that 
throw from user code


 *  PartitionDataMissingError
 * In certain scenarios producer data was transferred in blocking mode or data 
was saved in persistent store. If the partition was missing, we need to 
revoke/rerun the produce task to regenerate the data.
 * Introduce a new exception PartitionDataMissingException to wrap all those 
kinds of issues.


 * EnvironmentError
 * It happened due to hardware, or software issues that were related to 
specific environments. The assumption is that a task will succeed if we run it 
in a different environment, and other task run in this bad environment will 
very likely fail. If multiple task failures in the same machine due to 
EnvironmentError, we need to consider adding the bad machine to blacklist, and 
avoiding schedule task on it.
 * Introduce a new exception EnvironmentException to wrap all those kind of 
issues.


 * Recoverable
 * We assume other issues are recoverable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10288) Failover Strategy improvement

2018-09-06 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10288:
---

 Summary: Failover Strategy improvement
 Key: FLINK-10288
 URL: https://issues.apache.org/jira/browse/FLINK-10288
 Project: Flink
  Issue Type: Improvement
  Components: JobManager
Reporter: JIN SUN
Assignee: JIN SUN


Flink pays significant efforts to make Streaming Job fault tolerant. The 
checkpoint mechanism and exactly once semantics make Flink different than other 
systems. However, there are still some cases not been handled very well. Those 
cases can apply to both Streaming and Batch scenarios, and its orthogonal with 
current fault tolerant mechanism. Here is a summary of those cases:
 # Some failures are non-recoverable, such as a user error: 
DividebyZeroException. We shouldn't try to restart the task, as it will never 
succeed. The DivideByZeroException is just a simple case, those errors sometime 
are not easy to reproduce or predict, as it might be only triggered by specific 
input data, we shouldn’t retry for all user code exceptions.
 # There is no limit for task retry today, unless a SuppressRestartException 
was encountered, a task will keep on retrying until it succeeds. As mentioned 
above, we shouldn’t retry for some cases at all, and for the Exceptions we can 
retry, such as a network exception, should we have a retry limit? We need retry 
for any transient issue, but we also need to set a limit to avoid infinite 
retry and resource wasting. For Batch and Streaming workload, we might need 
different strategies.
 # There are some exceptions due to hardware issues, such as disk/network 
malfunction. when a task/TaskManager fail on this, we’d better detect and avoid 
to schedule to that machine next time.
 # If a task read from a blocking result partition, when its input is not 
available, we can ‘revoke’ the produce task, set the task fail and rerun the 
upstream task to regenerate data.  the revoke can propagate up through the 
chain. In Spark, revoke is naturally support by lineage.

To make fault tolerance easier, we need to keep deterministic behavior as much 
as possible. For user code, it’s not easy to control. However, for system 
related code, we can fix it. For example, we should at least make sure the 
different attempt of a same task to have the same inputs (we have a bug in 
current codebase (DataSourceTask) that cannot guarantee this). Note that this 
is track by [Flink-10205]

Details see this proposal:

[https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?usp=sharing]
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-08-26 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593082#comment-16593082
 ] 

JIN SUN commented on FLINK-10205:
-

Great Stephan, i actually has the same idea, for the batch failover, there are 
a bunch of things we need to consider and discuss. I can prepare a FLIP. 

> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Assignee: JIN SUN
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-08-24 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591489#comment-16591489
 ] 

JIN SUN commented on FLINK-10205:
-

In my opinion we need deterministic, especially when a task rerun, the output 
should as same as the previous version. Today's logic might skip the split that 
previous task processed and assign a different split, this will lead incorrect 
result.

Fabian, we didn't restrict the InputSplit assignment, instead, by add a small 
piece of code in Execution.java and ExecutionVertex.java, we can make it 
deterministic.

i'm preparing the code, we can have a further discussion when it ready. 

> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-08-24 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591472#comment-16591472
 ] 

JIN SUN commented on FLINK-10205:
-

Thanks Vinoyang,

I cannot assign the bug to myself, I've send a email to Aljoscha Krettek ask 
for permission. 

Jin




> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-08-23 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591066#comment-16591066
 ] 

JIN SUN commented on FLINK-10205:
-

I would like to work on this issue

> Batch Job: InputSplit Fault tolerant for DataSourceTask
> ---
>
> Key: FLINK-10205
> URL: https://issues.apache.org/jira/browse/FLINK-10205
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Reporter: JIN SUN
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Today DataSource Task pull InputSplits from JobManager to achieve better 
> performance, however, when a DataSourceTask failed and rerun, it will not get 
> the same splits as its previous version. this will introduce inconsistent 
> result or even data corruption.
> Furthermore,  if there are two executions run at the same time (in batch 
> scenario), this two executions should process same splits.
> we need to fix the issue to make the inputs of a DataSourceTask 
> deterministic. The propose is save all splits into ExecutionVertex and 
> DataSourceTask will pull split from there.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10205) Batch Job: InputSplit Fault tolerant for DataSourceTask

2018-08-23 Thread JIN SUN (JIRA)
JIN SUN created FLINK-10205:
---

 Summary: Batch Job: InputSplit Fault tolerant for DataSourceTask
 Key: FLINK-10205
 URL: https://issues.apache.org/jira/browse/FLINK-10205
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager
Reporter: JIN SUN


Today DataSource Task pull InputSplits from JobManager to achieve better 
performance, however, when a DataSourceTask failed and rerun, it will not get 
the same splits as its previous version. this will introduce inconsistent 
result or even data corruption.

Furthermore,  if there are two executions run at the same time (in batch 
scenario), this two executions should process same splits.

we need to fix the issue to make the inputs of a DataSourceTask deterministic. 
The propose is save all splits into ExecutionVertex and DataSourceTask will 
pull split from there.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)