[jira] [Commented] (FLINK-30403) The reported latest completed checkpoint is discarded

2023-02-23 Thread Zdenek Tison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17692567#comment-17692567
 ] 

Zdenek Tison commented on FLINK-30403:
--

Hi, thanks for asking. No, let's close it. 

> The reported latest completed checkpoint is discarded
> -
>
> Key: FLINK-30403
> URL: https://issues.apache.org/jira/browse/FLINK-30403
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.16.0
>Reporter: Zdenek Tison
>Priority: Major
>
> There is a small window where the reported latest completed checkpoint can be 
> marked as discarded while the new checkpoint wasn't reported yet. 
> The reason is that the function 
> _addCompletedCheckpointToStoreAndSubsumeOldest_  is called before 
> _reportCompletedCheckpoint_ in _CheckpointCoordinator._
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30403) The reported latest completed checkpoint is discarded

2022-12-13 Thread Zdenek Tison (Jira)
Zdenek Tison created FLINK-30403:


 Summary: The reported latest completed checkpoint is discarded
 Key: FLINK-30403
 URL: https://issues.apache.org/jira/browse/FLINK-30403
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.16.0
Reporter: Zdenek Tison


There is a small window where the reported latest completed checkpoint can be 
marked as discarded while the new checkpoint wasn't reported yet. 

The reason is that the function _addCompletedCheckpointToStoreAndSubsumeOldest_ 
 is called before _reportCompletedCheckpoint_ in _CheckpointCoordinator._

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-14114) Shift down ClusterClient#timeout to RestClusterClient

2019-09-25 Thread tison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937893#comment-16937893
 ] 

tison commented on FLINK-14114:
---

With another investigation I think this timeout, whose value is the same as 
akka timeout, is a default value without good reason. During the effort of 
FLIP-74 we possibly replace all synchronous  methods of ClusterClient with 
their asynchronous version. If so, this timeout can be removed, where users of 
these method should make the decision about the exact timeout at place.

> Shift down ClusterClient#timeout to RestClusterClient
> -
>
> Key: FLINK-14114
> URL: https://issues.apache.org/jira/browse/FLINK-14114
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: tison
>Assignee: Zhu Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{ClusterClient#timeout}} is only used in {{RestClusterClient}}, even without 
> this prerequisite we can always shift down {{timeout}} field to subclasses of 
> {{ClusterClient}}. It is towards an interface-ized {{ClusterClient}}. By side 
> effect, we could reduce the dependency to parsing duration with Scala 
> Duration on the fly.
> CC [~till.rohrmann] [~zhuzh]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-14114) Shift down ClusterClient#timeout to RestClusterClient

2019-09-25 Thread tison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937893#comment-16937893
 ] 

tison edited comment on FLINK-14114 at 9/25/19 4:24 PM:


With another investigation I think this timeout, whose value is the same as 
akka timeout, is a default value without good reason. During the effort of 
FLIP-74 we possibly replace all synchronous  methods of ClusterClient with 
their asynchronous version. If so, this timeout can be removed, where users of 
these method should make the decision about the exact timeout in place.


was (Author: tison):
With another investigation I think this timeout, whose value is the same as 
akka timeout, is a default value without good reason. During the effort of 
FLIP-74 we possibly replace all synchronous  methods of ClusterClient with 
their asynchronous version. If so, this timeout can be removed, where users of 
these method should make the decision about the exact timeout at place.

> Shift down ClusterClient#timeout to RestClusterClient
> -
>
> Key: FLINK-14114
> URL: https://issues.apache.org/jira/browse/FLINK-14114
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: tison
>Assignee: Zhu Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{ClusterClient#timeout}} is only used in {{RestClusterClient}}, even without 
> this prerequisite we can always shift down {{timeout}} field to subclasses of 
> {{ClusterClient}}. It is towards an interface-ized {{ClusterClient}}. By side 
> effect, we could reduce the dependency to parsing duration with Scala 
> Duration on the fly.
> CC [~till.rohrmann] [~zhuzh]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-14114) Shift down ClusterClient#timeout to RestClusterClient

2019-09-25 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison resolved FLINK-14114.
---
Resolution: Fixed

master via ca56b63c37176dabf163b745d6ee015012531243

> Shift down ClusterClient#timeout to RestClusterClient
> -
>
> Key: FLINK-14114
> URL: https://issues.apache.org/jira/browse/FLINK-14114
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: tison
>Assignee: Zhu Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{ClusterClient#timeout}} is only used in {{RestClusterClient}}, even without 
> this prerequisite we can always shift down {{timeout}} field to subclasses of 
> {{ClusterClient}}. It is towards an interface-ized {{ClusterClient}}. By side 
> effect, we could reduce the dependency to parsing duration with Scala 
> Duration on the fly.
> CC [~till.rohrmann] [~zhuzh]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (FLINK-14051) Deploy job cluster in attached mode

2019-09-25 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reopened FLINK-14051:
---

Reopen as issue is valid although we would just revisit this once prerequisite 
ready.

> Deploy job cluster in attached mode
> ---
>
> Key: FLINK-14051
> URL: https://issues.apache.org/jira/browse/FLINK-14051
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission, Command Line Client
>Affects Versions: 1.10.0
>Reporter: tison
>Priority: Major
>
> While working on FLINK-14048 I revisit the problem we handle deploy logic in 
> a complicated if-else branches in {{CliFrontend#runProgram}}. Previously we 
> said even in per-job mode and attached we deploy a session cluster for 
> historical reasons.
> However, I notice that {{#deployJobCluster}} has a parameter {{boolean 
> detached}}. Also it is used in sql-client package. So it looks like we can 
> deploy job cluster in attached mode as we do in sql-client package.
> However, as [~xccui] answered on mailing list 
> [here|https://lists.apache.org/x/thread.html/5464459db08f2a756af0c61eb02d34a26f04c27c62140886cad52731@%3Cuser.flink.apache.org%3E],
>  we support only standalone session cluster for sql-client. So maybe it is 
> not our case. Anyway, if we cannot deploy job cluster in attached mode, I'd 
> like to know the concrete reason.
> CC [~till.rohrmann] [~twalthr]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14051) Deploy job cluster in attached mode

2019-09-25 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-14051:
--
Affects Version/s: (was: 1.10.0)

> Deploy job cluster in attached mode
> ---
>
> Key: FLINK-14051
> URL: https://issues.apache.org/jira/browse/FLINK-14051
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission, Command Line Client
>Reporter: tison
>Priority: Major
>
> While working on FLINK-14048 I revisit the problem we handle deploy logic in 
> a complicated if-else branches in {{CliFrontend#runProgram}}. Previously we 
> said even in per-job mode and attached we deploy a session cluster for 
> historical reasons.
> However, I notice that {{#deployJobCluster}} has a parameter {{boolean 
> detached}}. Also it is used in sql-client package. So it looks like we can 
> deploy job cluster in attached mode as we do in sql-client package.
> However, as [~xccui] answered on mailing list 
> [here|https://lists.apache.org/x/thread.html/5464459db08f2a756af0c61eb02d34a26f04c27c62140886cad52731@%3Cuser.flink.apache.org%3E],
>  we support only standalone session cluster for sql-client. So maybe it is 
> not our case. Anyway, if we cannot deploy job cluster in attached mode, I'd 
> like to know the concrete reason.
> CC [~till.rohrmann] [~twalthr]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14183) Remove unnecessary scala Duration usages in flink-runtime

2019-09-24 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-14183:
-

Assignee: Zhu Zhu

> Remove unnecessary scala Duration usages in flink-runtime
> -
>
> Key: FLINK-14183
> URL: https://issues.apache.org/jira/browse/FLINK-14183
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> This ticket is to remove all usages of scala {{Duration/FiniteDuration}} in 
> {{flink-runtime}}, except for those usages for {{Akka}} components (in 
> AkkaUtils, AkkaRpcActor and ActorSystemScheduledExecutorAdapter).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14183) Remove unnecessary scala Duration usages in flink-runtime

2019-09-24 Thread tison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936559#comment-16936559
 ] 

tison commented on FLINK-14183:
---

Of course!

> Remove unnecessary scala Duration usages in flink-runtime
> -
>
> Key: FLINK-14183
> URL: https://issues.apache.org/jira/browse/FLINK-14183
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> This ticket is to remove all usages of scala {{Duration/FiniteDuration}} in 
> {{flink-runtime}}, except for those usages for {{Akka}} components (in 
> AkkaUtils, AkkaRpcActor and ActorSystemScheduledExecutorAdapter).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14168) Remove unused BootstrapTools#generateTaskManagerConfiguration

2019-09-24 Thread tison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936545#comment-16936545
 ] 

tison commented on FLINK-14168:
---

Yes it seems that method even never gets into used. Good to remove in my 
opinion.

It is a bit wried that even in the 
[commit|https://github.com/apache/flink/commit/92ff2b152cac3ad6a53373c0c022579306051133]
 it was checked in, I cannot find its usage. Thus I'd like to involve its 
author [~mxm] here to see if anything missed.

> Remove unused BootstrapTools#generateTaskManagerConfiguration
> -
>
> Key: FLINK-14168
> URL: https://issues.apache.org/jira/browse/FLINK-14168
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> {{BootstrapTools#generateTaskManagerConfiguration}} is not used anymore while 
> it adds {{scala.concurrent.duration.FiniteDuration}} dependency to 
> {{BootstrapTools}}.
> I think we can remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14168) Remove unused BootstrapTools#generateTaskManagerConfiguration

2019-09-24 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-14168:
-

Assignee: Zhu Zhu

> Remove unused BootstrapTools#generateTaskManagerConfiguration
> -
>
> Key: FLINK-14168
> URL: https://issues.apache.org/jira/browse/FLINK-14168
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> {{BootstrapTools#generateTaskManagerConfiguration}} is not used anymore while 
> it adds {{scala.concurrent.duration.FiniteDuration}} dependency to 
> {{BootstrapTools}}.
> I think we can remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14182) Make TimeUtils able to parse duration string with plural form labels

2019-09-24 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-14182:
-

Assignee: Zhu Zhu

> Make TimeUtils able to parse duration string with plural form labels
> 
>
> Key: FLINK-14182
> URL: https://issues.apache.org/jira/browse/FLINK-14182
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> Scala Duration supports parsing plural form time unit label except for the 
> shortest label of a TimeUnit. Namely:
> {
> "d day days",
> "h hour s",
> "min minute minutes",
> "s sec secs second seconds",
> "ms milli millis millisecond milliseconds",
> "µs micro micros microsecond microseconds",
> "ns nano nanos nanosecond nanoseconds"
> }
> TimeUtils should support them as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14182) Make TimeUtils able to parse duration string with plural form labels

2019-09-24 Thread tison (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936533#comment-16936533
 ] 

tison commented on FLINK-14182:
---

I've assigned the ticket to you [~zhuzh]. Go ahead :-)

> Make TimeUtils able to parse duration string with plural form labels
> 
>
> Key: FLINK-14182
> URL: https://issues.apache.org/jira/browse/FLINK-14182
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> Scala Duration supports parsing plural form time unit label except for the 
> shortest label of a TimeUnit. Namely:
> {
> "d day days",
> "h hour s",
> "min minute minutes",
> "s sec secs second seconds",
> "ms milli millis millisecond milliseconds",
> "µs micro micros microsecond microseconds",
> "ns nano nanos nanosecond nanoseconds"
> }
> TimeUtils should support them as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-14177:
-

Assignee: lamber-ken

> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
> Fix For: 1.9.1
>
>
> According to FLINK-10052 and FLINK-14177, we needs to upgrade the version of 
> CuratorFramework firstly.
> Curator4.2.0 supports
> 1) zk3.4.* and zk3.5.* 
> 2) connectionStateErrorPolicy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-14113) Remove class JobWithJars

2019-09-23 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison resolved FLINK-14113.
---
Resolution: Fixed

master via f373628a0e68b79ff2437ea8c8114ed8d3114091

> Remove class JobWithJars
> 
>
> Key: FLINK-14113
> URL: https://issues.apache.org/jira/browse/FLINK-14113
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: tison
>Assignee: tison
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{JobWithJars}} is a batch-only concept, acts as a POJO consists of {{Plan}} 
> and {{URL}}s of libs. We can
> 1. inline the usage of {{Plan}} and {{URL}}s as we do in streaming case.
> 2. extract static methods into a utility class said {{ClientUtils}}.
> The main purpose here is towards no batch specific concept that doesn't bring 
> too much good.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14096) Merge NewClusterClient into ClusterClient

2019-09-20 Thread tison (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison closed FLINK-14096.
-
Resolution: Resolved

master via 1c20b8397299b17279e30ed4dca1f9efe6b8d9ec

> Merge NewClusterClient into ClusterClient
> -
>
> Key: FLINK-14096
> URL: https://issues.apache.org/jira/browse/FLINK-14096
> Project: Flink
>  Issue Type: Sub-task
>  Components: Client / Job Submission
>Reporter: tison
>Assignee: tison
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> With the effort under FLINK-10392 we don't need the bridge class 
> {{NewClusterClient}} any more. We can just merge {{NewClusterClient}} into 
> {{ClusterClient}} towards an interface-ized {{ClusterClient}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14149) Introduce ZooKeeperLeaderElectionServiceNG

2019-09-20 Thread tison (Jira)
tison created FLINK-14149:
-

 Summary: Introduce ZooKeeperLeaderElectionServiceNG
 Key: FLINK-14149
 URL: https://issues.apache.org/jira/browse/FLINK-14149
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: tison
Assignee: tison


Subsequent to the discussion in FLINK-10333, we reach a consensus that refactor 
ZK based storage with a transaction store mechanism. The overall design can be 
found in the design document linked below.

This subtask is aimed at introducing the prerequisite to adopt transaction 
store, i.e., a new leader election service for ZK scenario. The necessity is 
that we have to retrieve the corresponding latch path per contender following 
the algorithm describe in FLINK-10333.

Here is the (descriptive) details about the implementation.

We adopt the optimized version of [this 
recipe|https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection][1].
 Code details can be found in [this 
branch|https://github.com/TisonKun/flink/tree/election-service] and the state 
machine can be found in the design document attached. Here is only the most 
important two differences from the former implementation:

(1) *Leader election is an one-shot service.*

Specifically, we only create one latch for a specific contender. We tolerate 
{{SUSPENDED}} a.k.a. {{CONNECTIONLOSS}} so that the only situation we lost 
leadership is session expired, which infers the ephemeral latch znode is 
deleted. We don't re-participant as contender so after {{revokeLeadership}} a 
contender will never be granted any more. This is not a problem but we can do 
further refactor in contender side for better behavior.

(2) *Leader info znode is {{PERSISTENT}}.*

It is because we now regard create/setData to leader info znode a leader-only 
operation and thus do it in a transaction. If we keep using ephemeral znode it 
is hard to test. Because we share ZK client so the ephemeral znode is not 
deleted so that we should deal with complex znode stat that transaction cannot 
simply deal with. And since znode is {{PERSISTENT}} we introduce a 
{{concealLeaderInfo}} method called back on contender stop to clean up.

Another topic is about interface. Back to the big picture of FLINK-10333 we 
eventually use a transaction store for persisting job graph and checkpoint and 
so on. So there will be a {{getLeaderStore}} method added on 
{{LeaderElectionServices}}. Because we don't use it at all it is an open 
question that whether we add the method to the interface in this subtask. And 
if so, whether we implement it for other election services implementation.

{{concealLeaderInfo}} is another method appeared in the document that aimed at 
clean up leader info node on stop. So the same problem as {{getLeaderStore}}.

**For what we gain**

1. Basics for the overall goal under FLINK-10333
2. Leader info node must be modified by the current leader. Thus we can reduce 
a lot of concurrency handling logic in currently ZLES, including using 
{{NodeCache}} as well as dealing with complex stat of ephemeral leader info 
node.

[1] For other implementation, I start [a 
thread|https://lists.apache.org/x/thread.html/594b66ecb1d60b560a5c4c08ed1b2a67bc29143cb4e8d368da8c39b2@%3Cuser.zookeeper.apache.org%3E]
 in ZK and Curator to discuss. Anyway, it will be implementation details only, 
and interfaces and semantics should not be affected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-10508) Port JobManagerITCase to new code base

2018-10-07 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-10508:
-

Assignee: tison

> Port JobManagerITCase to new code base
> --
>
> Key: FLINK-10508
> URL: https://issues.apache.org/jira/browse/FLINK-10508
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10508) Port JobManagerITCase to new code base

2018-10-07 Thread tison (JIRA)
tison created FLINK-10508:
-

 Summary: Port JobManagerITCase to new code base
 Key: FLINK-10508
 URL: https://issues.apache.org/jira/browse/FLINK-10508
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.7.0
Reporter: tison
 Fix For: 1.7.0


Port {{JobManagerITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10454) Travis fails on ScheduleOrUpdateConsumersTest

2018-10-05 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639508#comment-16639508
 ] 

tison commented on FLINK-10454:
---

https://travis-ci.org/apache/flink/jobs/434894763
https://api.travis-ci.org/v3/job/434894763/log.txt

https://travis-ci.org/apache/flink/jobs/434894758
https://api.travis-ci.org/v3/job/434894758/log.txt

These are on a pull request branch, but since it can be reproduced locally and 
the pr has nothing to do with it I file this issue.

> Travis fails on ScheduleOrUpdateConsumersTest
> -
>
> Key: FLINK-10454
> URL: https://issues.apache.org/jira/browse/FLINK-10454
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0
>
>
> Can even be reproduced locally. Maybe a duplicate but as a reminder.
> {code:java}
> org.apache.flink.runtime.jobmanager.scheduler.ScheduleOrUpdateConsumersTest 
> Time elapsed: 4.514 sec <<< ERROR! java.net.BindException: Address already in 
> use at sun.nio.ch.Net.bind0(Native Method) at 
> sun.nio.ch.Net.bind(Net.java:433) at sun.nio.ch.Net.bind(Net.java:425) at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1358)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:1019)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
>  at 
> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:366)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>  at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10454) Travis fails on ScheduleOrUpdateConsumersTest

2018-10-05 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639508#comment-16639508
 ] 

tison edited comment on FLINK-10454 at 10/5/18 9:07 AM:


https://travis-ci.org/apache/flink/jobs/434894763
https://api.travis-ci.org/v3/job/434894763/log.txt

https://travis-ci.org/apache/flink/jobs/434894758
https://api.travis-ci.org/v3/job/434894758/log.txt

These are on a pull request branch, but since it can be reproduced locally(on 
master branch) and the pr has nothing to do with it I file this issue.


was (Author: tison):
https://travis-ci.org/apache/flink/jobs/434894763
https://api.travis-ci.org/v3/job/434894763/log.txt

https://travis-ci.org/apache/flink/jobs/434894758
https://api.travis-ci.org/v3/job/434894758/log.txt

These are on a pull request branch, but since it can be reproduced locally and 
the pr has nothing to do with it I file this issue.

> Travis fails on ScheduleOrUpdateConsumersTest
> -
>
> Key: FLINK-10454
> URL: https://issues.apache.org/jira/browse/FLINK-10454
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0
>
>
> Can even be reproduced locally. Maybe a duplicate but as a reminder.
> {code:java}
> org.apache.flink.runtime.jobmanager.scheduler.ScheduleOrUpdateConsumersTest 
> Time elapsed: 4.514 sec <<< ERROR! java.net.BindException: Address already in 
> use at sun.nio.ch.Net.bind0(Native Method) at 
> sun.nio.ch.Net.bind(Net.java:433) at sun.nio.ch.Net.bind(Net.java:425) at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1358)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:1019)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
>  at 
> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:366)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>  at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10392) Remove legacy mode

2018-10-04 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638600#comment-16638600
 ] 

tison commented on FLINK-10392:
---

[~isunjin] Previously I made a similar proposal but  now let's say that the 
removal of legacy mode is a goal of release-1.7.0 thus need not a temporary 
move (to "legacy" folder or something). I'll appreciate it if you have time and 
participate in this thread :-)

See also FLINK-10302.

> Remove legacy mode
> --
>
> Key: FLINK-10392
> URL: https://issues.apache.org/jira/browse/FLINK-10392
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> This issue is the umbrella issue to remove the legacy mode code from Flink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10456) Remove org.apache.flink.api.common.time.Deadline

2018-10-04 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637891#comment-16637891
 ] 

tison commented on FLINK-10456:
---

Thanks for explanation!

> Remove org.apache.flink.api.common.time.Deadline
> 
>
> Key: FLINK-10456
> URL: https://issues.apache.org/jira/browse/FLINK-10456
> Project: Flink
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
>
> We already have {{scala.concurrent.duration.Deadline}}.
> {{org.apache.flink.api.common.time.Deadline}} is not a rich extend of it. I 
> suspect at which situation we need a customized Deadline. If not, introduce a 
> weak alternation seems unreasonable and raise confusion.
> What do you think? cc [~StephanEwen] [~Zentol]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8033) Build Flink with JDK 9

2018-10-01 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-8033:
-
Description: colored textThis is a JIRA to track all issues that found to 
make Flink compatible with Java 9.  (was: This is a JIRA to track all issues 
that found to make Flink compatible with Java 9.)

> Build Flink with JDK 9
> --
>
> Key: FLINK-8033
> URL: https://issues.apache.org/jira/browse/FLINK-8033
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System
>Affects Versions: 1.4.0
>Reporter: Hai Zhou
>Assignee: Chesnay Schepler
>Priority: Major
> Fix For: 1.7.0
>
>
> colored textThis is a JIRA to track all issues that found to make Flink 
> compatible with Java 9.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10469) FileChannel may not write the whole buffer in a single call to FileChannel.write(Buffer buffer)

2018-10-01 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633639#comment-16633639
 ] 

tison commented on FLINK-10469:
---

Sounds reasonable. Maybe [~zjwang] and [~NicoK] could provide more professional 
advice.

> FileChannel may not write the whole buffer in a single call to 
> FileChannel.write(Buffer buffer)
> ---
>
> Key: FLINK-10469
> URL: https://issues.apache.org/jira/browse/FLINK-10469
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.4.1, 1.4.2, 1.5.3, 1.6.0, 1.6.1, 1.7.0, 1.5.4, 1.6.2
>Reporter: Yun Gao
>Priority: Major
>
> Currently all the calls to _FileChannel.write(ByteBuffer src)_ assumes that 
> this method will not return before the whole buffer is written, like the one 
> in _AsynchronousFileIOChannel.write()._
>  
> However, this assumption may not be right for all the environments. We have 
> encountered the case that only part of a buffer was written on a cluster with 
> a high IO load, and the target file got messy. 
>  
> To fix this issue, I think we should add a utility method in the 
> org.apache.flink.util.IOUtils to ensure the whole buffer is written with a 
> loop,and replace all the calls to _FileChannel.write(ByteBuffer)_ with this 
> new method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10468) Potential missing break for PARTITION_CUSTOM in OutputEmitter ctor

2018-09-30 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633413#comment-16633413
 ] 

tison commented on FLINK-10468:
---

I just have taken a look at that code and think it is intended since 
{{PARTITION_CUSTOM}} means that partitioning using a custom partitioner so it 
sets a {{extractedKeys}}, but all of the strategy need a {{channels}}.

However, I am not quite familiar with this code and think [~zjwang], 
[~piwaniuk] and [~NicoK] would know more about it. And if they are stand the 
same way as me, we can close this issue as won't fix.

> Potential missing break for PARTITION_CUSTOM in OutputEmitter ctor
> --
>
> Key: FLINK-10468
> URL: https://issues.apache.org/jira/browse/FLINK-10468
> Project: Flink
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
>
> Here is related code:
> {code}
> switch (strategy) {
> case PARTITION_CUSTOM:
>   extractedKeys = new Object[1];
> case FORWARD:
> {code}
> It seems a 'break' is missing prior to FORWARD case.
> {code}
> if (strategy == ShipStrategyType.PARTITION_CUSTOM && partitioner == null) 
> {
>   throw new NullPointerException("Partitioner must not be null when the 
> ship strategy is set to custom partitioning.");
> }
> {code}
> Since the above check is for PARTITION_CUSTOM, it seems we can place the 
> check in the switch statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10468) Potential missing break for PARTITION_CUSTOM in OutputEmitter ctor

2018-09-30 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-10468:
-

Assignee: (was: tison)

> Potential missing break for PARTITION_CUSTOM in OutputEmitter ctor
> --
>
> Key: FLINK-10468
> URL: https://issues.apache.org/jira/browse/FLINK-10468
> Project: Flink
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
>
> Here is related code:
> {code}
> switch (strategy) {
> case PARTITION_CUSTOM:
>   extractedKeys = new Object[1];
> case FORWARD:
> {code}
> It seems a 'break' is missing prior to FORWARD case.
> {code}
> if (strategy == ShipStrategyType.PARTITION_CUSTOM && partitioner == null) 
> {
>   throw new NullPointerException("Partitioner must not be null when the 
> ship strategy is set to custom partitioning.");
> }
> {code}
> Since the above check is for PARTITION_CUSTOM, it seems we can place the 
> check in the switch statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10468) Potential missing break for PARTITION_CUSTOM in OutputEmitter ctor

2018-09-30 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-10468:
-

Assignee: tison

> Potential missing break for PARTITION_CUSTOM in OutputEmitter ctor
> --
>
> Key: FLINK-10468
> URL: https://issues.apache.org/jira/browse/FLINK-10468
> Project: Flink
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: tison
>Priority: Minor
>
> Here is related code:
> {code}
> switch (strategy) {
> case PARTITION_CUSTOM:
>   extractedKeys = new Object[1];
> case FORWARD:
> {code}
> It seems a 'break' is missing prior to FORWARD case.
> {code}
> if (strategy == ShipStrategyType.PARTITION_CUSTOM && partitioner == null) 
> {
>   throw new NullPointerException("Partitioner must not be null when the 
> ship strategy is set to custom partitioning.");
> }
> {code}
> Since the above check is for PARTITION_CUSTOM, it seems we can place the 
> check in the switch statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10466) flink-yarn-tests should depend flink-dist

2018-09-29 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633102#comment-16633102
 ] 

tison commented on FLINK-10466:
---

[~Zentol] I have seen reporter differs from assignee and now learn we do not 
assign issue to other people, sorry for burdening.

And yes, theoretically it would be an issue and thus I wonder what current 
tarvis works well with it. flink-yarn-tests depends on flink-fist but this is 
not explicitly described on pom.xml. Could we add it as a defender?

> flink-yarn-tests should depend flink-dist
> -
>
> Key: FLINK-10466
> URL: https://issues.apache.org/jira/browse/FLINK-10466
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> may be adding
> {code:java}
> 
>  org.apache.flink
>  flink-dist_${scala.binary.version}
>  ${project.version}
>  test
>  pom
> {code}
> not really sure but it causes failure on my automate testing process, and by 
> adding this dependency the error disappear. Even I wonder how it works 
> currently on travis.
> flink-yarn-test obviously depends on flink-dist since some tests try to find 
> flink uberjar.
> Please take a look for this. cc [~Zentol]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10466) flink-yarn-tests should depend flink-dist

2018-09-29 Thread tison (JIRA)
tison created FLINK-10466:
-

 Summary: flink-yarn-tests should depend flink-dist
 Key: FLINK-10466
 URL: https://issues.apache.org/jira/browse/FLINK-10466
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.7.0
Reporter: tison
Assignee: Chesnay Schepler
 Fix For: 1.7.0


may be adding
{code:java}

 org.apache.flink
 flink-dist_${scala.binary.version}
 ${project.version}
 test
 pom
{code}
not really sure but it causes failure on my automate testing process, and by 
adding this dependency the error disappear. Even I wonder how it works 
currently on travis.

flink-yarn-test obviously depends on flink-dist since some tests try to find 
flink uberjar.

Please take a look for this. cc [~Zentol]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10405) Port JobManagerFailsITCase to new code base

2018-09-29 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632975#comment-16632975
 ] 

tison commented on FLINK-10405:
---

In FLIP-6, TM would try to reconnect JM 
https://github.com/apache/flink/commit/63d4819e197b1df1651157fd8f86c8ca0540d0b1

> Port JobManagerFailsITCase to new code base
> ---
>
> Key: FLINK-10405
> URL: https://issues.apache.org/jira/browse/FLINK-10405
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerFailsITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10456) Remove org.apache.flink.api.common.time.Deadline

2018-09-27 Thread tison (JIRA)
tison created FLINK-10456:
-

 Summary: Remove org.apache.flink.api.common.time.Deadline
 Key: FLINK-10456
 URL: https://issues.apache.org/jira/browse/FLINK-10456
 Project: Flink
  Issue Type: Improvement
  Components: Core
Affects Versions: 1.7.0
Reporter: tison
Assignee: tison
 Fix For: 1.7.0


We already have {{scala.concurrent.duration.Deadline}}.

{{org.apache.flink.api.common.time.Deadline}} is not a rich extend of it. I 
suspect at which situation we need a customized Deadline. If not, introduce a 
weak alternation seems unreasonable and raise confusion.

What do you think? cc [~StephanEwen] [~Zentol]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10454) Travis fails on ScheduleOrUpdateConsumersTest

2018-09-27 Thread tison (JIRA)
tison created FLINK-10454:
-

 Summary: Travis fails on ScheduleOrUpdateConsumersTest
 Key: FLINK-10454
 URL: https://issues.apache.org/jira/browse/FLINK-10454
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.7.0
Reporter: tison
 Fix For: 1.7.0


Can even be reproduced locally. Maybe a duplicate but as a reminder.

{code:java}
org.apache.flink.runtime.jobmanager.scheduler.ScheduleOrUpdateConsumersTest 
Time elapsed: 4.514 sec <<< ERROR! java.net.BindException: Address already in 
use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:433) 
at sun.nio.ch.Net.bind(Net.java:425) at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1358)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:1019)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
 at 
org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:366)
 at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
 at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
 at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
 at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10450) Broken links in the documentation

2018-09-27 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630136#comment-16630136
 ] 

tison commented on FLINK-10450:
---

For the {{/flinkDev/building.html}} part, it might be relevant to [this 
commit|https://github.com/apache/flink/commit/52cbe07ba7a367880475af59596adc2365bd8a21].
 Maybe [~fhueske] could be involved.

> Broken links in the documentation
> -
>
> Key: FLINK-10450
> URL: https://issues.apache.org/jira/browse/FLINK-10450
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Project Website
>Affects Versions: 1.7.0
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.7.0
>
>
> {code}
> [2018-09-27 09:57:51] ERROR `/flinkdev/building.html' not found.
> [2018-09-27 09:57:51] ERROR `/dev/stream/dataset_transformations.html' not 
> found.
> [2018-09-27 09:57:51] ERROR `/dev/stream/windows.html' not found.
> ---
> Found 3 broken links.
> Search for page containing broken link using 'grep -R BROKEN_PATH DOCS_DIR'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10429) Redesign Flink Scheduling, introducing dedicated Scheduler component

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629769#comment-16629769
 ] 

tison edited comment on FLINK-10429 at 9/27/18 5:18 AM:


I like the proposal treat schedule a separated component. It is helpful for 
further optimize on scheduling such as FLINK-10240.

To  [~zhuzh], [~tiemsn]
I think the first step of the achieve of this redesign would be the extract 
part, maybe you can take this document into consideration? Also, the link to 
FLINK-10240 is broken, the correct one is 
https://docs.google.com/document/d/1zAseuBnqNXg3pst3vLBTc8yGOUo485J2LVWBAdFCW9I/
 (without the "edit" part). And I afraid that it would prevent further 
discussion that this document is READ-ONLY, any thought could not be commented 
on.

For more information, our users start looking for more flexible schedule 
strategy. Wish this redesign could help.


was (Author: tison):
I like the proposal treat schedule a separated component. It is helpful for 
further optimize on scheduling such as FLINK-10240.

To  [~zhuzh], [~tiemsn]
I think the first step of the achieve of this redesign would be the extract 
part, maybe you can take this document into consideration? Also, the link to 
FLINK-10240 is broken, the correct one is 
https://docs.google.com/document/d/1zAseuBnqNXg3pst3vLBTc8yGOUo485J2LVWBAdFCW9I/
 . And I afraid that it would prevent further discussion that this document is 
READ-ONLY, any thought could not be commented on.

For more information, our users start looking for more flexible schedule 
strategy. Wish this redesign could help.

> Redesign Flink Scheduling, introducing dedicated Scheduler component
> 
>
> Key: FLINK-10429
> URL: https://issues.apache.org/jira/browse/FLINK-10429
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
>
> This epic tracks the redesign of scheduling in Flink. Scheduling is currently 
> a concern that is scattered across different components, mainly the 
> ExecutionGraph/Execution and the SlotPool. Scheduling also happens only on 
> the granularity of individual tasks, which make holistic scheduling 
> strategies hard to implement. In this epic we aim to introduce a dedicated 
> Scheduler component that can support use-case like auto-scaling, 
> local-recovery, and resource optimized batch.
> The design for this feature is developed here: 
> https://docs.google.com/document/d/1q7NOqt05HIN-PlKEEPB36JiuU1Iu9fnxxVGJzylhsxU/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10429) Redesign Flink Scheduling, introducing dedicated Scheduler component

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629769#comment-16629769
 ] 

tison edited comment on FLINK-10429 at 9/27/18 5:18 AM:


I like the proposal treat schedule a separated component. It is helpful for 
further optimize on scheduling such as FLINK-10240.

To  [~zhuzh], [~tiemsn]
I think the first step of the achieve of this redesign would be the extract 
part, maybe you can take this document into consideration? Also, the link to 
FLINK-10240 is broken, the correct one is 
https://docs.google.com/document/d/1zAseuBnqNXg3pst3vLBTc8yGOUo485J2LVWBAdFCW9I/
 (without the "edit" part). And I afraid that it would prevent further 
discussion that this document is READ-ONLY, any thought could not be commented 
on.

For more information, our users start looking for more flexible schedule 
strategy[1]. Wish this redesign could help.

[1] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Scheduling-sources-td23344.html
 


was (Author: tison):
I like the proposal treat schedule a separated component. It is helpful for 
further optimize on scheduling such as FLINK-10240.

To  [~zhuzh], [~tiemsn]
I think the first step of the achieve of this redesign would be the extract 
part, maybe you can take this document into consideration? Also, the link to 
FLINK-10240 is broken, the correct one is 
https://docs.google.com/document/d/1zAseuBnqNXg3pst3vLBTc8yGOUo485J2LVWBAdFCW9I/
 (without the "edit" part). And I afraid that it would prevent further 
discussion that this document is READ-ONLY, any thought could not be commented 
on.

For more information, our users start looking for more flexible schedule 
strategy. Wish this redesign could help.

> Redesign Flink Scheduling, introducing dedicated Scheduler component
> 
>
> Key: FLINK-10429
> URL: https://issues.apache.org/jira/browse/FLINK-10429
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
>
> This epic tracks the redesign of scheduling in Flink. Scheduling is currently 
> a concern that is scattered across different components, mainly the 
> ExecutionGraph/Execution and the SlotPool. Scheduling also happens only on 
> the granularity of individual tasks, which make holistic scheduling 
> strategies hard to implement. In this epic we aim to introduce a dedicated 
> Scheduler component that can support use-case like auto-scaling, 
> local-recovery, and resource optimized batch.
> The design for this feature is developed here: 
> https://docs.google.com/document/d/1q7NOqt05HIN-PlKEEPB36JiuU1Iu9fnxxVGJzylhsxU/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10429) Redesign Flink Scheduling, introducing dedicated Scheduler component

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629769#comment-16629769
 ] 

tison commented on FLINK-10429:
---

I like the proposal treat schedule a separated component. It is helpful for 
further optimize on scheduling such as FLINK-10240.

To  [~zhuzh], [~tiemsn]
I think the first step of the achieve of this redesign would be the extract 
part, maybe you can take this document into consideration? Also, the link to 
FLINK-10240 is broken, the correct one is 
https://docs.google.com/document/d/1zAseuBnqNXg3pst3vLBTc8yGOUo485J2LVWBAdFCW9I/
 . And I afraid that it would prevent further discussion that this document is 
READ-ONLY, any thought could not be commented on.

For more information, our users start looking for more flexible schedule 
strategy. Wish this redesign could help.

> Redesign Flink Scheduling, introducing dedicated Scheduler component
> 
>
> Key: FLINK-10429
> URL: https://issues.apache.org/jira/browse/FLINK-10429
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
>
> This epic tracks the redesign of scheduling in Flink. Scheduling is currently 
> a concern that is scattered across different components, mainly the 
> ExecutionGraph/Execution and the SlotPool. Scheduling also happens only on 
> the granularity of individual tasks, which make holistic scheduling 
> strategies hard to implement. In this epic we aim to introduce a dedicated 
> Scheduler component that can support use-case like auto-scaling, 
> local-recovery, and resource optimized batch.
> The design for this feature is developed here: 
> https://docs.google.com/document/d/1q7NOqt05HIN-PlKEEPB36JiuU1Iu9fnxxVGJzylhsxU/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10426) Port TaskTest to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629199#comment-16629199
 ] 

tison commented on FLINK-10426:
---

This sub tasks should consider to test {{Task}} fails if blobs missing.

> Port TaskTest to new code base
> --
>
> Key: FLINK-10426
> URL: https://issues.apache.org/jira/browse/FLINK-10426
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{TaskTest}} to new code base



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10427) Port JobSubmitTest to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629002#comment-16629002
 ] 

tison edited comment on FLINK-10427 at 9/26/18 5:47 PM:


{{testFailureWhenJarBlobsMissing}} should be ported to {{TaskTest#...}}, FLIP-6 
loads library on TM, not JM.

To clarify, it is not applicable for FLIP-6 JM fail job because of failing to 
load library. So the porting job is active only when porting {{TaskTest}}. 
(Because the {{TaskTest}} is legacy test, too. we are unable to JUST add a test 
case onto that for now).


was (Author: tison):
{{testFailureWhenJarBlobsMissing}} should be ported to {{TaskTest#...}}, FLIP-6 
loads library on TM, not JM.

> Port JobSubmitTest to new code base
> ---
>
> Key: FLINK-10427
> URL: https://issues.apache.org/jira/browse/FLINK-10427
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobSubmitTest}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10427) Port JobSubmitTest to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629130#comment-16629130
 ] 

tison commented on FLINK-10427:
---

{{testAnswerFailureWhenSavepointReadFails}} is covered by 
{{SavepointITCase#testSubmitWithUnknownSavepointPath}}

> Port JobSubmitTest to new code base
> ---
>
> Key: FLINK-10427
> URL: https://issues.apache.org/jira/browse/FLINK-10427
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobSubmitTest}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10427) Port JobSubmitTest to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629111#comment-16629111
 ] 

tison commented on FLINK-10427:
---

{{testFailureWhenInitializeOnMasterFails}} is covered by 
{{JobSubmissionFailsITCase#testExceptionInInitializeOnMaster}}

> Port JobSubmitTest to new code base
> ---
>
> Key: FLINK-10427
> URL: https://issues.apache.org/jira/browse/FLINK-10427
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobSubmitTest}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10427) Port JobSubmitTest to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629002#comment-16629002
 ] 

tison commented on FLINK-10427:
---

{{testFailureWhenJarBlobsMissing}} should be ported to {{TaskTest#...}}, FLIP-6 
loads library on TM, not JM.

> Port JobSubmitTest to new code base
> ---
>
> Key: FLINK-10427
> URL: https://issues.apache.org/jira/browse/FLINK-10427
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobSubmitTest}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10436) Example config uses deprecated key jobmanager.rpc.address

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628805#comment-16628805
 ] 

tison commented on FLINK-10436:
---

Confirm that it can be reproduced. From my side it is obviously wart.

> Example config uses deprecated key jobmanager.rpc.address
> -
>
> Key: FLINK-10436
> URL: https://issues.apache.org/jira/browse/FLINK-10436
> Project: Flink
>  Issue Type: Bug
>  Components: Startup Shell Scripts
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: tison
>Priority: Major
>
> The example {{flink-conf.yaml}} shipped as part of the Flink distribution 
> (https://github.com/apache/flink/blob/master/flink-dist/src/main/resources/flink-conf.yaml)
>  has the following entry:
> {code}
> jobmanager.rpc.address: localhost
> {code}
> When using this key, the following deprecation warning is logged.
> {code}
> 2018-09-26 12:01:46,608 WARN  org.apache.flink.configuration.Configuration
>   - Config uses deprecated configuration key 
> 'jobmanager.rpc.address' instead of proper key 'rest.address'
> {code}
> The example config should not use deprecated config options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10437) Some of keys under withDeprecatedKeys aren't marked as @depreacted

2018-09-26 Thread tison (JIRA)
tison created FLINK-10437:
-

 Summary: Some of keys under withDeprecatedKeys aren't marked as 
@depreacted
 Key: FLINK-10437
 URL: https://issues.apache.org/jira/browse/FLINK-10437
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.7.0
Reporter: tison


as title. For example {{RestOptions#BIND_ADDRESS}} is 
{{withDeprecatedKeys(WebOptions.ADDRESS.key())}}, but {{WebOptions.ADDRESS}} 
isn't marked as deprecated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10436) Example config uses deprecated key jobmanager.rpc.address

2018-09-26 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-10436:
-

Assignee: tison

> Example config uses deprecated key jobmanager.rpc.address
> -
>
> Key: FLINK-10436
> URL: https://issues.apache.org/jira/browse/FLINK-10436
> Project: Flink
>  Issue Type: Bug
>  Components: Startup Shell Scripts
>Affects Versions: 1.7.0
>Reporter: Ufuk Celebi
>Assignee: tison
>Priority: Major
>
> The example {{flink-conf.yaml}} shipped as part of the Flink distribution 
> (https://github.com/apache/flink/blob/master/flink-dist/src/main/resources/flink-conf.yaml)
>  has the following entry:
> {code}
> jobmanager.rpc.address: localhost
> {code}
> When using this key, the following deprecation warning is logged.
> {code}
> 2018-09-26 12:01:46,608 WARN  org.apache.flink.configuration.Configuration
>   - Config uses deprecated configuration key 
> 'jobmanager.rpc.address' instead of proper key 'rest.address'
> {code}
> The example config should not use deprecated config options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10406) Port JobManagerTest to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628709#comment-16628709
 ] 

tison commented on FLINK-10406:
---

{{testKvStateMessages}} is ported to 4 tests in {{JobMasterTest}}

{{JobMasterTest#testRequestKvStateWithoutRegistration}}
{{JobMasterTesttestRequestKvStateWithIrrelevantRegistration}}
{{JobMasterTest#testRegisterAndUnregisterKvState}}
{{JobMasterTest#testDuplicatedKvStateRegistrationsFailTask}}

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10406) Port JobManagerTest to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628615#comment-16628615
 ] 

tison commented on FLINK-10406:
---

{{testResourceManagerConnection}} is not applicable to FLIP-6 new code base, 
FLIP-6 has its own reconnect logic between JM and RM and the mechanism should 
be guarded by

{{JobMasterTest#testReconnectionAfterDisconnect}}
{{JobMasterTest#testResourceManagerConnectionAfterRegainingLeadership}}
{{JobMasterTest#testCloseUnestablishedResourceManagerConnection}}

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10406) Port JobManagerTest to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628594#comment-16628594
 ] 

tison commented on FLINK-10406:
---

- {{testSavepointWithDeactivatedPeriodicCheckpointing}} is ported to 
{{JobMasterTriggerSavepointIT#testStopJobAfterSavepointWithDeactivatedPeriodicCheckpointing}},
 with a little refactor to enable the latter test class configure 
checkpointInterval(to deactivated periodic checkpointing).

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10405) Port JobManagerFailsITCase to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628335#comment-16628335
 ] 

tison commented on FLINK-10405:
---

cc [~trohrm...@apache.org] is {{DispatcherProcess}} introduced with the purpose 
I state above?

> Port JobManagerFailsITCase to new code base
> ---
>
> Key: FLINK-10405
> URL: https://issues.apache.org/jira/browse/FLINK-10405
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerFailsITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10405) Port JobManagerFailsITCase to new code base

2018-09-26 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628331#comment-16628331
 ] 

tison commented on FLINK-10405:
---

The porting job would take advantage from FLINK-10403 
https://github.com/apache/flink/pull/6751/commits/df034e1ca192a74962da2792deef7b3c78de047c

It introduces a test utils {{DispatcherProcess}}, which, on JM failure, could 
start a new one and take over. IIRC {{MiniCluster}} does not provide such 
feature.

> Port JobManagerFailsITCase to new code base
> ---
>
> Key: FLINK-10405
> URL: https://issues.apache.org/jira/browse/FLINK-10405
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerFailsITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10392) Remove legacy mode

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628234#comment-16628234
 ] 

tison commented on FLINK-10392:
---

[~till.rohrmann] I am afraid that if we add every test porting job a sub task 
then we will get a long list even before we start the removal of legacy project 
production file. It is acceptable or we can do some squash work?

> Remove legacy mode
> --
>
> Key: FLINK-10392
> URL: https://issues.apache.org/jira/browse/FLINK-10392
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> This issue is the umbrella issue to remove the legacy mode code from Flink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628197#comment-16628197
 ] 

tison edited comment on FLINK-10406 at 9/26/18 3:41 AM:


- {{testCancelWithSavepoint}} is covered by 
{{JobMasterTriggerSavepointIT#testStopJobAfterSavepoint}}

- {{testCancelWithSavepointNoDirectoriesConfigured}} is somehow covered by 
{{JobMasterTriggerSavepointIT#testDoNotCancelJobIfSavepointFails}}. Now we 
don't provide detail error message to dig out the cause a savepoint fails. 
{{testDoNotCancelJobIfSavepointFails}} tests if the savepoint path permission 
denied, but change it to a /not/exist/path provide the same process.

the exception stringified as "java.util.concurrent.ExecutionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to 
trigger savepoint. Decline reason: An Exception occurred while triggering the 
checkpoint."

- {{testCancelJobWithSavepointFailurePeriodicCheckpoints}} is covered by 
{{JobMasterTriggerSavepointIT#testDoNotCancelJobIfSavepointFails}}.


was (Author: tison):
- {{testCancelWithSavepoint}} is covered by 
{{JobMasterTriggerSavepointIT#testStopJobAfterSavepoint}}

- {{testCancelWithSavepointNoDirectoriesConfigured}} is somehow covered by 
{{JobMasterTriggerSavepointIT#testDoNotCancelJobIfSavepointFails}}. Now we 
don't provide detail error message to dig out the cause a savepoint fails. 
{{testDoNotCancelJobIfSavepointFails}} tests if the savepoint path permission 
denied, but change it to a /not/exist/path provide the same process.

the exception stringified as "java.util.concurrent.ExecutionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to 
trigger savepoint. Decline reason: An Exception occurred while triggering the 
checkpoint."

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628197#comment-16628197
 ] 

tison edited comment on FLINK-10406 at 9/26/18 3:27 AM:


- {{testCancelWithSavepoint}} is covered by 
{{JobMasterTriggerSavepointIT#testStopJobAfterSavepoint}}

- {{testCancelWithSavepointNoDirectoriesConfigured}} is somehow covered by 
{{JobMasterTriggerSavepointIT#testDoNotCancelJobIfSavepointFails}}. Now we 
don't provide detail error message to dig out the cause a savepoint fails. 
{{testDoNotCancelJobIfSavepointFails}} tests if the savepoint path permission 
denied, but change it to a /not/exist/path provide the same process.

the exception stringified as "java.util.concurrent.ExecutionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to 
trigger savepoint. Decline reason: An Exception occurred while triggering the 
checkpoint."


was (Author: tison):
- {{testCancelWithSavepoint}} is covered by 
{{JobMasterTriggerSavepointIT#testStopJobAfterSavepoint}}

- {{testCancelWithSavepointNoDirectoriesConfigured}} is somehow covered by 
{{JobMasterTriggerSavepointIT#testDoNotCancelJobIfSavepointFails}}. Now we 
don't provide detail error message to dig out the cause a savepoint fails. 
{{testDoNotCancelJobIfSavepointFails}} tests if the savepoint path permission 
denied, but change it to a /not/exist/path provide the same process.

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628197#comment-16628197
 ] 

tison edited comment on FLINK-10406 at 9/26/18 3:26 AM:


- {{testCancelWithSavepoint}} is covered by 
{{JobMasterTriggerSavepointIT#testStopJobAfterSavepoint}}

- {{testCancelWithSavepointNoDirectoriesConfigured}} is somehow covered by 
{{JobMasterTriggerSavepointIT#testDoNotCancelJobIfSavepointFails}}. Now we 
don't provide detail error message to dig out the cause a savepoint fails. 
{{testDoNotCancelJobIfSavepointFails}} tests if the savepoint path permission 
denied, but change it to a /not/exist/path provide the same process.


was (Author: tison):
- {{testCancelWithSavepoint}} is covered by 
{{JobMasterTriggerSavepointIT#testStopJobAfterSavepoint}}

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628197#comment-16628197
 ] 

tison commented on FLINK-10406:
---

- {{testCancelWithSavepoint}} is covered by 
{{JobMasterTriggerSavepointIT#testStopJobAfterSavepoint}}

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10427) Port JobSubmitTest to new code base

2018-09-25 Thread tison (JIRA)
tison created FLINK-10427:
-

 Summary: Port JobSubmitTest to new code base
 Key: FLINK-10427
 URL: https://issues.apache.org/jira/browse/FLINK-10427
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.7.0
Reporter: tison
Assignee: tison
 Fix For: 1.7.0


Port {{JobSubmitTest}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628148#comment-16628148
 ] 

tison edited comment on FLINK-10406 at 9/26/18 2:57 AM:


- {{testSavepointRestoreSettings}} is covered by 
{{JobMaster#testRestoringFromSavepoint}}

the {{triggerSavepoint}} part is covered by {{JobMasterTriggerSavepointIT}}, 
and the submit failure part should be taken care of when port 
{{JobSubmitTest}}, which has a test {{testAnswerFailureWhenSavepointReadFails}}


was (Author: tison):
- {{testSavepointRestoreSettings}} is covered by 
{{JobMaster#testRestoringFromSavepoint}}

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628148#comment-16628148
 ] 

tison commented on FLINK-10406:
---

- {{testSavepointRestoreSettings}} is covered by 
{{JobMaster#testRestoringFromSavepoint}}

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627806#comment-16627806
 ] 

tison edited comment on FLINK-10406 at 9/25/18 7:28 PM:


* {{testStopSignal}} and  {{testStopSignalFail}} are covered by 
{{ExecutionGraphStopTest}}. High level invocation at {{Dispatcher}} and 
{{JobMaster}} level are trivial.
 * {{testNullHostnameGoesToLocalhost}} is ported to {{AkkaUtilsTest#"null 
hostname should go to localhost"}}
 * {{testRequestPartitionState*}} I would propose to ignore all of them since 
we have FLINK-10319. It proposed to disable {{JobMaster#requestPartitionState}} 
and have one approval and no objection yet. Also cc [~trohrm...@apache.org], 
could you take a look at FLINK-10319 so that we could make the decision of this 
removal? (UPDATE: even without FLINK-10319 accepted, these tests should be 
covered by {{JobMasterTest#testRequestPartitionState}} and {{TaskTest#...}})


was (Author: tison):
* {{testStopSignal}} and  {{testStopSignalFail}} are covered by 
{{ExecutionGraphStopTest}}. High level invocation at {{Dispatcher}} and 
{{JobMaster}} level are trivial.
 * {{testNullHostnameGoesToLocalhost}} is ported to {{AkkaUtilsTest#"null 
hostname should go to localhost"}}
 * {{testRequestPartitionState*}} I would propose to ignore all of them since 
we have FLINK-10319. It proposed to disable {{JobMaster#requestPartitionState}} 
and have one approval and no objection yet. Also cc [~trohrm...@apache.org], 
could you take a look at FLINK-10319 so that we could make the decision of this 
removal?

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627806#comment-16627806
 ] 

tison edited comment on FLINK-10406 at 9/25/18 7:22 PM:


* {{testStopSignal}} and  {{testStopSignalFail}} are covered by 
{{ExecutionGraphStopTest}}. High level invocation at {{Dispatcher}} and 
{{JobMaster}} level are trivial.
 * {{testNullHostnameGoesToLocalhost}} is ported to {{AkkaUtilsTest#"null 
hostname should go to localhost"}}
 * {{testRequestPartitionState*}} I would propose to ignore all of them since 
we have FLINK-10319. It proposed to disable {{JobMaster#requestPartitionState}} 
and have one approval and no objection yet. Also cc [~trohrm...@apache.org], 
could you take a look at FLINK-10319 so that we could make the decision of this 
removal?


was (Author: tison):
* {{testStopSignal}} and  {{testStopSignalFail}} are covered by 
{{ExecutionGraphStopTest}}. High level invocation at {{Dispatcher}} and 
{{JobMaster}} level are trivial.
 * {{testNullHostnameGoesToLocalhost}} is ported to {{AkkaUtilsTest#"null 
hostname should go to localhost"}}

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627806#comment-16627806
 ] 

tison edited comment on FLINK-10406 at 9/25/18 7:13 PM:


* {{testStopSignal}} and  {{testStopSignalFail}} are covered by 
{{ExecutionGraphStopTest}}. High level invocation at {{Dispatcher}} and 
{{JobMaster}} level are trivial.
 * {{testNullHostnameGoesToLocalhost}} is ported to {{AkkaUtilsTest#"null 
hostname should go to localhost"}}


was (Author: tison):
* {{testStopSignal}} and  {{testStopSignalFail}} are covered by 
{{ExecutionGraphStopTest}}. High level invocation at {{Dispatcher}} and 
{{JobMaster}} level are trivial.
 *

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10406) Port JobManagerTest to new code base

2018-09-25 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627806#comment-16627806
 ] 

tison commented on FLINK-10406:
---

* {{testStopSignal}} and  {{testStopSignalFail}} are covered by 
{{ExecutionGraphStopTest}}. High level invocation at {{Dispatcher}} and 
{{JobMaster}} level are trivial.
 *

> Port JobManagerTest to new code base
> 
>
> Key: FLINK-10406
> URL: https://issues.apache.org/jira/browse/FLINK-10406
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> Port {{JobManagerTest}} to new code base
> Not all of its tests should be ported, since some of them are covered by 
> {{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (FLINK-10240) Pluggable scheduling strategy for batch jobs

2018-09-25 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-10240:
--
Comment: was deleted

(was: Introduce pluggable schedule strategy is an excellent idea that could 
expand a lot the cases Flink is able to handle.

I like this idea and can help. However, the document attached above is 
read-only. So I remain my comments as a link to a copy of it below. Most of 
them are layout improvements and minor reword, the body of document is no more 
than the original design.

https://docs.google.com/document/d/15pUYc5_yrY2IwmnADCoNWZwOIYCOcroWmuuZHs-vdlU/edit?usp=sharing

Note that this is a EDITABLE document and everyone interest on it can remains 
comments or edit it directly. As an open source software we just trust our 
contributors and the document could be frozen and left comment-only if the 
discussion reaches a consensus.)

> Pluggable scheduling strategy for batch jobs
> 
>
> Key: FLINK-10240
> URL: https://issues.apache.org/jira/browse/FLINK-10240
> Project: Flink
>  Issue Type: New Feature
>  Components: Distributed Coordination
>Reporter: Zhu Zhu
>Priority: Major
>  Labels: scheduling
>
> Currently batch jobs are scheduled with LAZY_FROM_SOURCES strategy: source 
> tasks are scheduled in the beginning, and other tasks are scheduled once 
> there input data are consumable.
> However, input data consumable does not always mean the task can work at 
> once. 
>  
> One example is the hash join operation, where the operator first consumes one 
> side(we call it build side) to setup a table, then consumes the other side(we 
> call it probe side) to do the real join work. If the probe side is started 
> early, it just get stuck on back pressure as the join operator will not 
> consume data from it before the building stage is done, causing a waste of 
> resources.
> If we have the probe side task started after the build stage is done, both 
> the build and probe side can have more computing resources as they are 
> staggered.
>  
> That's why we think a flexible scheduling strategy is needed, allowing job 
> owners to customize the vertex schedule order and constraints. Better 
> resource utilization usually means better performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10386) Remove legacy class TaskExecutionStateListener

2018-09-25 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-10386:
--
Issue Type: Sub-task  (was: Improvement)
Parent: FLINK-10392

> Remove legacy class TaskExecutionStateListener
> --
>
> Key: FLINK-10386
> URL: https://issues.apache.org/jira/browse/FLINK-10386
> Project: Flink
>  Issue Type: Sub-task
>  Components: TaskManager
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> After a discussion 
> [here|https://github.com/apache/flink/commit/0735b5b935b0c0757943e2d58047afcfb9949560#commitcomment-30584257]
>  with [~trohrm...@apache.org]. I start to analyze the usage of 
> {{ActorGatewayTaskExecutionStateListener}} and {{TaskExecutionStateListener}}.
> In conclusion, we abort {{TaskExecutionStateListener}} strategy and no any 
> component rely on it. Instead, we introduce {{TaskManagerActions}} to take 
> the role for the communication of {{Task}} with {{TaskManager}}. No one 
> except {{TaskManager}} should directly communicate with {{Task}}. So it can 
> be safely remove legacy class {{TaskExecutionStateListener}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10426) Port TaskTest to new code base

2018-09-25 Thread tison (JIRA)
tison created FLINK-10426:
-

 Summary: Port TaskTest to new code base
 Key: FLINK-10426
 URL: https://issues.apache.org/jira/browse/FLINK-10426
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.7.0
Reporter: tison
Assignee: tison
 Fix For: 1.7.0


Port {{TaskTest}} to new code base



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10413) requestPartitionState messages overwhelms JM RPC main thread

2018-09-24 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626789#comment-16626789
 ] 

tison commented on FLINK-10413:
---

This might be a duplicate of FLINK-10319

> requestPartitionState messages overwhelms JM RPC main thread
> 
>
> Key: FLINK-10413
> URL: https://issues.apache.org/jira/browse/FLINK-10413
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Zhu Zhu
>Assignee: vinoyang
>Priority: Major
>
> We tried to benchmark the job scheduling performance with a 2000x2000 
> ALL-to-ALL streaming(EAGER) job. The input data is empty so the tasks 
> finishes soon after started.
> In this case we see slow RPC responses and TM/RM heartbeats to JM will 
> finally timeout.
> We find ~2,000,000 requestPartitionState messages triggered by 
> triggerPartitionProducerStateCheck in a short time, which overwhelms JM RPC 
> main thread. This is due to downstream tasks can be started earlier than 
> upstream tasks in EAGER scheduling.
>  
> We's suggest no partition producer state check to avoid this issue. The task 
> can just keep waiting for a while and retrying if the partition does not 
> exist. There are two cases when the partition does not exist:
>  # the partition is not started yet
>  # the partition is failed
> In case 1, retry works. In case 2, a task failover will soon happen and 
> cancel the downstream tasks as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10251) Handle oversized response messages in AkkaRpcActor

2018-09-24 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-10251:
-

Assignee: (was: tison)

> Handle oversized response messages in AkkaRpcActor
> --
>
> Key: FLINK-10251
> URL: https://issues.apache.org/jira/browse/FLINK-10251
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The {{AkkaRpcActor}} should check whether an RPC response which is sent to a 
> remote sender does not exceed the maximum framesize of the underlying 
> {{ActorSystem}}. If this is the case we should fail fast instead. We can 
> achieve this by serializing the response and sending the serialized byte 
> array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10406) Port JobManagerTest to new code base

2018-09-24 Thread tison (JIRA)
tison created FLINK-10406:
-

 Summary: Port JobManagerTest to new code base
 Key: FLINK-10406
 URL: https://issues.apache.org/jira/browse/FLINK-10406
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.7.0
Reporter: tison
Assignee: tison
 Fix For: 1.7.0


Port {{JobManagerTest}} to new code base

Not all of its tests should be ported, since some of them are covered by 
{{JobMasterTest}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10405) Port JobManagerFailsITCase to new code base

2018-09-24 Thread tison (JIRA)
tison created FLINK-10405:
-

 Summary: Port JobManagerFailsITCase to new code base
 Key: FLINK-10405
 URL: https://issues.apache.org/jira/browse/FLINK-10405
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.7.0
Reporter: tison
Assignee: tison
 Fix For: 1.7.0


Port {{JobManagerFailsITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-10256) Port legacy jobmanager test to FILP-6

2018-09-23 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison closed FLINK-10256.
-
  Resolution: Duplicate
Release Note: The purpose of this issue would be covered by FLINK-10392

> Port legacy jobmanager test to FILP-6
> -
>
> Key: FLINK-10256
> URL: https://issues.apache.org/jira/browse/FLINK-10256
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: tison
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0
>
>
> I am planning to rework JobManagerFailsITCase and JobManagerTest into 
> JobMasterITCase and JobMasterHAITCase. That is, reorganize the legacy tests, 
> make them neat and cover cases explicitly. The PR would follow before this 
> weekend.
> While reworking, I'd like to add more jm failover test cases list below, for 
> the further implement of jm failover with RECONCILING state. For "jm 
> failover", I mean a real world failover(like low power or process exit), 
> without calling Flink internal postStop logic or something like it.
> 1. Streaming task with jm failover.
> 2. Streaming task with jm failover concurrent to task fail.
> 3. Batch task with jm failover.
> 4. Batch task with jm failover concurrent to task fail.
> 5. Batch task with jm failover when some vertex has already been FINISHED.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10400) Return failed JobResult if job terminates in state FAILED or CANCELED

2018-09-23 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625030#comment-16625030
 ] 

tison commented on FLINK-10400:
---

Agree. It is a code wart that should be fixed.
To be more clear, return a {{JobResult}} with {{Exception}} as described, 
{{addSuppressed}} if there is a failure cause.

> Return failed JobResult if job terminates in state FAILED or CANCELED
> -
>
> Key: FLINK-10400
> URL: https://issues.apache.org/jira/browse/FLINK-10400
> Project: Flink
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 1.6.1, 1.7.0, 1.5.4
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> If the job reaches the globally terminal state {{FAILED}} or {{CANCELED}}, 
> the {{JobResult}} must return a non-successful result. At the moment, it can 
> happen that in the {{CANCELED}} state where we don't find a failure cause 
> that we return a successful {{JobResult}}.
> In order to change this I propose to always return a {{JobResult}} with a 
> {{JobCancellationException}} in case of {{CANCELED}} and a 
> {{JobExecutionException}} in case of {{FAILED}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10399) Refractor ParameterTool#fromArgs

2018-09-23 Thread tison (JIRA)
tison created FLINK-10399:
-

 Summary: Refractor ParameterTool#fromArgs
 Key: FLINK-10399
 URL: https://issues.apache.org/jira/browse/FLINK-10399
 Project: Flink
  Issue Type: Improvement
  Components: Client
Affects Versions: 1.7.0
Reporter: tison
Assignee: tison
 Fix For: 1.7.0


{{ParameterTool#fromArgs}} uses a weird implement which flink developer would 
fail to parse it fast.
The main problem is that, when parse args, we always try to get a key-value 
pair, but the implement iterate by a {{for}} loop, thus introduce weird 
flag/mutable variable and branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10251) Handle oversized response messages in AkkaRpcActor

2018-09-23 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison reassigned FLINK-10251:
-

Assignee: tison

> Handle oversized response messages in AkkaRpcActor
> --
>
> Key: FLINK-10251
> URL: https://issues.apache.org/jira/browse/FLINK-10251
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Assignee: tison
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The {{AkkaRpcActor}} should check whether an RPC response which is sent to a 
> remote sender does not exceed the maximum framesize of the underlying 
> {{ActorSystem}}. If this is the case we should fail fast instead. We can 
> achieve this by serializing the response and sending the serialized byte 
> array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10397) Remove CoreOptions#MODE

2018-09-22 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624925#comment-16624925
 ] 

tison edited comment on FLINK-10397 at 9/23/18 3:52 AM:


This sub task would depend on the rest sub tasks, especially heavily on 
FLINK-10396. Since FLINK-10396 is about test scope and this one is about the 
whole project. Would be better to send a pull request/patch based on 
FLINK-10396 changes.

To clarify, I use "depend on" but the point is this one abstractly covers 
FLINK-10396 .


was (Author: tison):
This sub task would depend on the rest sub tasks, especially heavily on 
FLINK-10396. Since FLINK-10396 is about test scope and this one is about the 
whole project. Would be better to send a pull request/patch based on 
FLINK-10396 changes.

> Remove CoreOptions#MODE
> ---
>
> Key: FLINK-10397
> URL: https://issues.apache.org/jira/browse/FLINK-10397
> Project: Flink
>  Issue Type: Sub-task
>  Components: Configuration
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> Remove the {{CoreOptions#MODE}} since it is no longer needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10396) Remove codebase switch from MiniClusterResource

2018-09-22 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624924#comment-16624924
 ] 

tison edited comment on FLINK-10396 at 9/23/18 3:31 AM:


The patch attached disable all switches and remove nearly all of the branches, 
except those under yarn-test. Because I think it would be guided to a whole 
removal of some test such as {{YARNHighAvailabilityITCase}}, so set 
{{isNewMode}} to a final value {{true}} as a hint for the follow up.


was (Author: tison):
Disable all switches and remove nearly all of the branches, except those under 
yarn-test. Because I think it would be guided to a whole removal of some test 
such as {{YARNHighAvailabilityITCase}}, so set {{isNewMode}} to a final value 
{{true}} as a hint for the follow up.

> Remove codebase switch from MiniClusterResource
> ---
>
> Key: FLINK-10396
> URL: https://issues.apache.org/jira/browse/FLINK-10396
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: 
> 0001-FLINK-10396-Remove-codebase-switch-in-UT-IT-tests.patch
>
>
> Remove the legacy codebase switch from {{MiniClusterResource}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10397) Remove CoreOptions#MODE

2018-09-22 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624925#comment-16624925
 ] 

tison commented on FLINK-10397:
---

This sub task would depend on the rest sub tasks, especially heavily on 
FLINK-10396. Since FLINK-10396 is about test scope and this one is about the 
whole project. Would be better to send a pull request/patch based on 
FLINK-10396 changes.

> Remove CoreOptions#MODE
> ---
>
> Key: FLINK-10397
> URL: https://issues.apache.org/jira/browse/FLINK-10397
> Project: Flink
>  Issue Type: Sub-task
>  Components: Configuration
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> Remove the {{CoreOptions#MODE}} since it is no longer needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10396) Remove codebase switch from MiniClusterResource

2018-09-22 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624924#comment-16624924
 ] 

tison edited comment on FLINK-10396 at 9/23/18 3:26 AM:


Disable all switches and remove nearly all of the branches, except those under 
yarn-test. Because I think it would be guided to a whole removal of some test 
such as {{YARNHighAvailabilityITCase}}, so set {{isNewMode}} to a final value 
{{true}} as a hint for the follow up.


was (Author: tison):
Disable all switches and remove nearly all of the branches, except those under 
yarn-test. Because I think it would be guided to a whole removal of some test 
such as {{YARNHighAvailabilityITCase}}.

> Remove codebase switch from MiniClusterResource
> ---
>
> Key: FLINK-10396
> URL: https://issues.apache.org/jira/browse/FLINK-10396
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: 
> 0001-FLINK-10396-Remove-codebase-switch-in-UT-IT-tests.patch
>
>
> Remove the legacy codebase switch from {{MiniClusterResource}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10396) Remove codebase switch from MiniClusterResource

2018-09-22 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624924#comment-16624924
 ] 

tison commented on FLINK-10396:
---

Disable all switches and remove nearly all of the branches, except those under 
yarn-test. Because I think it would be guided to a whole removal of some test 
such as {{YARNHighAvailabilityITCase}}.

> Remove codebase switch from MiniClusterResource
> ---
>
> Key: FLINK-10396
> URL: https://issues.apache.org/jira/browse/FLINK-10396
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: 
> 0001-FLINK-10396-Remove-codebase-switch-in-UT-IT-tests.patch
>
>
> Remove the legacy codebase switch from {{MiniClusterResource}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10396) Remove codebase switch from MiniClusterResource

2018-09-22 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-10396:
--
Attachment: 0001-FLINK-10396-Remove-codebase-switch-in-UT-IT-tests.patch

> Remove codebase switch from MiniClusterResource
> ---
>
> Key: FLINK-10396
> URL: https://issues.apache.org/jira/browse/FLINK-10396
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: 
> 0001-FLINK-10396-Remove-codebase-switch-in-UT-IT-tests.patch
>
>
> Remove the legacy codebase switch from {{MiniClusterResource}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10396) Remove codebase switch from MiniClusterResource

2018-09-22 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624921#comment-16624921
 ] 

tison commented on FLINK-10396:
---

Besides test based on {{MiniClusterResource}}, we have legacy mode test at 
{{YarnTestBase}}, {{ScalaShellITCase}} and {{ScalaShellLocalStartupITCase}}. 
Once the umbrella issue solved, all of them would be invalid. Thus I propose 
change this issue name to be "Remove codebase switch in UT/IT tests".
All of the switches are depended on {{TestBaseUtils}} thus we can do the 
removal from there.

> Remove codebase switch from MiniClusterResource
> ---
>
> Key: FLINK-10396
> URL: https://issues.apache.org/jira/browse/FLINK-10396
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> Remove the legacy codebase switch from {{MiniClusterResource}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10392) Remove legacy mode

2018-09-22 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624906#comment-16624906
 ] 

tison edited comment on FLINK-10392 at 9/23/18 2:57 AM:


Thank you [~till.rohrmann] for kick off this thread! Mixing up FLIP-6 new mode 
with legacy mode which we would not support any more would confuse our 
customers as well as contributors. So I highly agree with the removal.

Though I think there is something I could help, since you take over the 
umbrella issue as well as all sub tasks, I would provide patches which you 
might make use of. I could take over it if you permit.


was (Author: tison):
Thank you [~till.rohrmann] for kick off this thread! Mixing up FLIP-6 new mode 
with legacy mode which we would not support any more would confuse our 
customers as well as contributors. So I highly agree with the removal.

Though I think there is something I could help, since you take over the 
umbrella issue as well as all sub tasks, I would provide patches which you 
might make use of.

> Remove legacy mode
> --
>
> Key: FLINK-10392
> URL: https://issues.apache.org/jira/browse/FLINK-10392
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> This issue is the umbrella issue to remove the legacy mode code from Flink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10395) Remove legacy mode switch from parent pom

2018-09-22 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-10395:
--
Attachment: 0001-FLINK-10395-Remove-legacy-mode-switch-from-parent-po.patch

> Remove legacy mode switch from parent pom
> -
>
> Key: FLINK-10395
> URL: https://issues.apache.org/jira/browse/FLINK-10395
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: 
> 0001-FLINK-10395-Remove-legacy-mode-switch-from-parent-po.patch
>
>
> Remove the legacy mode switch from the parent {{pom.xml}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10394) Remove legacy mode testing profiles from Travis config

2018-09-22 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-10394:
--
Attachment: 0001-FLINK-10394-Remove-legacy-mode-testing-profiles-from.patch

> Remove legacy mode testing profiles from Travis config
> --
>
> Key: FLINK-10394
> URL: https://issues.apache.org/jira/browse/FLINK-10394
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: 
> 0001-FLINK-10394-Remove-legacy-mode-testing-profiles-from.patch
>
>
> Remove the legacy mode testing profiles from Travis config.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10393) Remove legacy entrypoints from startup scripts

2018-09-22 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-10393:
--
Attachment: 0001-FLINK-10393-Remove-legacy-entrypoints-from-startup-s.patch

> Remove legacy entrypoints from startup scripts
> --
>
> Key: FLINK-10393
> URL: https://issues.apache.org/jira/browse/FLINK-10393
> Project: Flink
>  Issue Type: Sub-task
>  Components: Startup Shell Scripts
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: 
> 0001-FLINK-10393-Remove-legacy-entrypoints-from-startup-s.patch
>
>
> Remove the legacy entrypoints from the startup scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10393) Remove legacy entrypoints from startup scripts

2018-09-22 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-10393:
--
Attachment: (was: remove-script.patch)

> Remove legacy entrypoints from startup scripts
> --
>
> Key: FLINK-10393
> URL: https://issues.apache.org/jira/browse/FLINK-10393
> Project: Flink
>  Issue Type: Sub-task
>  Components: Startup Shell Scripts
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> Remove the legacy entrypoints from the startup scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10393) Remove legacy entrypoints from startup scripts

2018-09-22 Thread tison (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tison updated FLINK-10393:
--
Attachment: remove-script.patch

> Remove legacy entrypoints from startup scripts
> --
>
> Key: FLINK-10393
> URL: https://issues.apache.org/jira/browse/FLINK-10393
> Project: Flink
>  Issue Type: Sub-task
>  Components: Startup Shell Scripts
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: remove-script.patch
>
>
> Remove the legacy entrypoints from the startup scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10392) Remove legacy mode

2018-09-22 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624906#comment-16624906
 ] 

tison commented on FLINK-10392:
---

Thank you [~till.rohrmann] for kick off this thread! Mixing up FLIP-6 new mode 
with legacy mode which we would not support any more would confuse our 
customers as well as contributors. So I highly agree with the removal.

Though I think there is something I could help, since you take over the 
umbrella issue as well as all sub tasks, I would provide patches which you 
might make use of.

> Remove legacy mode
> --
>
> Key: FLINK-10392
> URL: https://issues.apache.org/jira/browse/FLINK-10392
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0
>
>
> This issue is the umbrella issue to remove the legacy mode code from Flink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10388) RestClientTest sometimes fails with AssertionError

2018-09-21 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623945#comment-16623945
 ] 

tison edited comment on FLINK-10388 at 9/21/18 5:49 PM:


Maybe relevant to FLINK-4052

cc [~StephanEwen]


was (Author: tison):
Maybe relevant to FLINK-4052

> RestClientTest sometimes fails with AssertionError
> --
>
> Key: FLINK-10388
> URL: https://issues.apache.org/jira/browse/FLINK-10388
> Project: Flink
>  Issue Type: Test
>Reporter: Ted Yu
>Priority: Minor
>
> Running the test on Linux I got:
> {code}
> testConnectionTimeout(org.apache.flink.runtime.rest.RestClientTest)  Time 
> elapsed: 1.918 sec  <<< FAILURE!
> java.lang.AssertionError:
> Expected: an instance of 
> org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException
>  but: 
>   Network is unreachable: /10.255.255.1:80> is a 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedSocketException
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
>   at org.junit.Assert.assertThat(Assert.java:956)
>   at org.junit.Assert.assertThat(Assert.java:923)
>   at 
> org.apache.flink.runtime.rest.RestClientTest.testConnectionTimeout(RestClientTest.java:69)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10388) RestClientTest sometimes fails with AssertionError

2018-09-21 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623945#comment-16623945
 ] 

tison commented on FLINK-10388:
---

Maybe relevant to FLINK-4052

> RestClientTest sometimes fails with AssertionError
> --
>
> Key: FLINK-10388
> URL: https://issues.apache.org/jira/browse/FLINK-10388
> Project: Flink
>  Issue Type: Test
>Reporter: Ted Yu
>Priority: Minor
>
> Running the test on Linux I got:
> {code}
> testConnectionTimeout(org.apache.flink.runtime.rest.RestClientTest)  Time 
> elapsed: 1.918 sec  <<< FAILURE!
> java.lang.AssertionError:
> Expected: an instance of 
> org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException
>  but: 
>   Network is unreachable: /10.255.255.1:80> is a 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedSocketException
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
>   at org.junit.Assert.assertThat(Assert.java:956)
>   at org.junit.Assert.assertThat(Assert.java:923)
>   at 
> org.apache.flink.runtime.rest.RestClientTest.testConnectionTimeout(RestClientTest.java:69)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-8803) Mini Cluster Shutdown with HA unstable, causing test failures

2018-09-21 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623459#comment-16623459
 ] 

tison edited comment on FLINK-8803 at 9/21/18 11:52 AM:


is it "won't fix" since it's all about {{FlinkMiniCluster}} which based on 
legacy mode? It is said the removal of legacy mode is part of 1.7.0. Maybe we 
would fix this for 1.5.x and 1.6.x but not 1.7.0?


was (Author: tison):
is it "won't fix" since it's all about {{FlinkMiniCluster}} which based on 
legacy mode?

> Mini Cluster Shutdown with HA unstable, causing test failures
> -
>
> Key: FLINK-8803
> URL: https://issues.apache.org/jira/browse/FLINK-8803
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Stephan Ewen
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> When the {{FlinkMiniCluster}} is created for HA tests with ZooKeeper, the 
> shutdown is unstable.
> It looks like ZooKeeper may be shut down before the JobManager is shut down, 
> causing the shutdown procedure of the JobManager (specifically 
> {{ZooKeeperSubmittedJobGraphStore.removeJobGraph}}) to block until tests time 
> out.
> Full log: https://api.travis-ci.org/v3/job/346853707/log.txt
> Note that no ZK threads are alive any more, seems ZK is shut down already.
> Relevant Stack Traces:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f973800a800 nid=0x43b4 waiting on 
> condition [0x7f973eb0b000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x8966cf18> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:212)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:157)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.ready(package.scala:169)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.startInternalShutdown(FlinkMiniCluster.scala:469)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.stop(FlinkMiniCluster.scala:435)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.closeAsync(FlinkMiniCluster.scala:719)
>   at 
> org.apache.flink.test.util.MiniClusterResource.after(MiniClusterResource.java:104)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:50)
> ...
> {code}
> {code}
> "flink-akka.actor.default-dispatcher-2" #1012 prio=5 os_prio=0 
> tid=0x7f97394fa800 nid=0x3328 waiting on condition [0x7f971db29000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x87f82a70> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:336)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:241)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:225)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
>   at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.release(ZooKeeperStateHandleStore.java:478)
>   at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.releaseAndTryRemove(ZooKeeperStateHandleStore.java:435)
>   at 
> 

[jira] [Commented] (FLINK-8803) Mini Cluster Shutdown with HA unstable, causing test failures

2018-09-21 Thread tison (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623459#comment-16623459
 ] 

tison commented on FLINK-8803:
--

is it "won't fix" since it's all about {{FlinkMiniCluster}} which based on 
legacy mode?

> Mini Cluster Shutdown with HA unstable, causing test failures
> -
>
> Key: FLINK-8803
> URL: https://issues.apache.org/jira/browse/FLINK-8803
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Stephan Ewen
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> When the {{FlinkMiniCluster}} is created for HA tests with ZooKeeper, the 
> shutdown is unstable.
> It looks like ZooKeeper may be shut down before the JobManager is shut down, 
> causing the shutdown procedure of the JobManager (specifically 
> {{ZooKeeperSubmittedJobGraphStore.removeJobGraph}}) to block until tests time 
> out.
> Full log: https://api.travis-ci.org/v3/job/346853707/log.txt
> Note that no ZK threads are alive any more, seems ZK is shut down already.
> Relevant Stack Traces:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7f973800a800 nid=0x43b4 waiting on 
> condition [0x7f973eb0b000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x8966cf18> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:212)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:157)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
>   at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.ready(package.scala:169)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.startInternalShutdown(FlinkMiniCluster.scala:469)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.stop(FlinkMiniCluster.scala:435)
>   at 
> org.apache.flink.runtime.minicluster.FlinkMiniCluster.closeAsync(FlinkMiniCluster.scala:719)
>   at 
> org.apache.flink.test.util.MiniClusterResource.after(MiniClusterResource.java:104)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:50)
> ...
> {code}
> {code}
> "flink-akka.actor.default-dispatcher-2" #1012 prio=5 os_prio=0 
> tid=0x7f97394fa800 nid=0x3328 waiting on condition [0x7f971db29000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x87f82a70> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:336)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:241)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:225)
>   at 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
>   at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.release(ZooKeeperStateHandleStore.java:478)
>   at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.releaseAndTryRemove(ZooKeeperStateHandleStore.java:435)
>   at 
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.releaseAndTryRemove(ZooKeeperStateHandleStore.java:405)
>   at 
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.removeJobGraph(ZooKeeperSubmittedJobGraphStore.java:266)
>   - locked <0x807f4258> (a