[jira] [Commented] (FLINK-14123) Change taskmanager.memory.fraction default value to 0.6

2019-10-16 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952651#comment-16952651
 ] 

Till Rohrmann commented on FLINK-14123:
---

The release notes you can find under 
https://github.com/apache/flink/blob/master/docs/release-notes/. The getting 
help can be found in the flink-web repository 
https://github.com/apache/flink-web/blob/asf-site/gettinghelp.md.

> Change taskmanager.memory.fraction default value to 0.6
> ---
>
> Key: FLINK-14123
> URL: https://issues.apache.org/jira/browse/FLINK-14123
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.9.0
>Reporter: liupengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we are testing flink batch task, such as terasort, however, it 
> started only awhile then it failed due to OOM. 
>  
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: a807e1d635bd4471ceea4282477f8850)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:539)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort$.main(FlinkTeraSort.scala:89)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort.main(FlinkTeraSort.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:604)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:466)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1007)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1080)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1080)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
>   ... 23 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: GC 
> overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
>   at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:82)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
> due to an exception: GC overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:831)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> 

[jira] [Created] (FLINK-14407) YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n failed on Travis

2019-10-16 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-14407:
-

 Summary: YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n 
failed on Travis
 Key: FLINK-14407
 URL: https://issues.apache.org/jira/browse/FLINK-14407
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem, Deployment / YARN
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0


The {{YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n}} fails on Travis 
with 

{code}
01:41:19.215 [ERROR] 
testRecursiveUploadForYarnS3n(org.apache.flink.yarn.YarnFileStageTestS3ITCase)  
Time elapsed: 6.891 s  <<< FAILURE!
java.lang.AssertionError: 

Expected: <{1=Hello 1, 2=Hello 2, nested/4/5=Hello nested/4/5, test.jar=JAR 
Content, nested/3=Hello nested/3}>
 but: was <{1=Hello 1, 2=Hello 2, nested/4/5=Hello nested/4/5, test.jar=JAR 
Content}>
at 
org.apache.flink.yarn.YarnFileStageTestS3ITCase.testRecursiveUploadForYarn(YarnFileStageTestS3ITCase.java:159)
at 
org.apache.flink.yarn.YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n(YarnFileStageTestS3ITCase.java:177)
{code}

https://api.travis-ci.org/v3/job/598412139/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14407) YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n failed on Travis

2019-10-16 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952650#comment-16952650
 ] 

Till Rohrmann commented on FLINK-14407:
---

[~NicoK] could you help with this issue?

> YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n failed on Travis
> 
>
> Key: FLINK-14407
> URL: https://issues.apache.org/jira/browse/FLINK-14407
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Deployment / YARN
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> The {{YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n}} fails on 
> Travis with 
> {code}
> 01:41:19.215 [ERROR] 
> testRecursiveUploadForYarnS3n(org.apache.flink.yarn.YarnFileStageTestS3ITCase)
>   Time elapsed: 6.891 s  <<< FAILURE!
> java.lang.AssertionError: 
> Expected: <{1=Hello 1, 2=Hello 2, nested/4/5=Hello nested/4/5, test.jar=JAR 
> Content, nested/3=Hello nested/3}>
>  but: was <{1=Hello 1, 2=Hello 2, nested/4/5=Hello nested/4/5, 
> test.jar=JAR Content}>
>   at 
> org.apache.flink.yarn.YarnFileStageTestS3ITCase.testRecursiveUploadForYarn(YarnFileStageTestS3ITCase.java:159)
>   at 
> org.apache.flink.yarn.YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n(YarnFileStageTestS3ITCase.java:177)
> {code}
> https://api.travis-ci.org/v3/job/598412139/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14403) Remove uesless NotifyCheckpointComplete and TriggerCheckpoint

2019-10-16 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-14403:
-

Assignee: Yun Tang

> Remove uesless NotifyCheckpointComplete and TriggerCheckpoint
> -
>
> Key: FLINK-14403
> URL: https://issues.apache.org/jira/browse/FLINK-14403
> Project: Flink
>  Issue Type: Bug
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
>
> After [FLINK-12322|https://issues.apache.org/jira/browse/FLINK-12322] fixed, 
> we have removed  legacy {{ActorTaskManagerGateway}} and the usage of 
> {{NotifyCheckpointComplete}} and {{TriggerCheckpoint}} have been disabled. 
> However, these classes still exist currently, we should also remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14403) Remove uesless NotifyCheckpointComplete and TriggerCheckpoint

2019-10-16 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952640#comment-16952640
 ] 

Till Rohrmann commented on FLINK-14403:
---

Sounds like a good clean up. Go ahead and open a PR for it [~yunta].

> Remove uesless NotifyCheckpointComplete and TriggerCheckpoint
> -
>
> Key: FLINK-14403
> URL: https://issues.apache.org/jira/browse/FLINK-14403
> Project: Flink
>  Issue Type: Bug
>Reporter: Yun Tang
>Priority: Major
>
> After [FLINK-12322|https://issues.apache.org/jira/browse/FLINK-12322] fixed, 
> we have removed  legacy {{ActorTaskManagerGateway}} and the usage of 
> {{NotifyCheckpointComplete}} and {{TriggerCheckpoint}} have been disabled. 
> However, these classes still exist currently, we should also remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14393) add an option to enable/disable cancel job in web ui

2019-10-16 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952638#comment-16952638
 ] 

Till Rohrmann commented on FLINK-14393:
---

Sounds like a useful feature. Do you wanna work on this [~vthinkxie]?

> add an option to enable/disable cancel job in web ui
> 
>
> Key: FLINK-14393
> URL: https://issues.apache.org/jira/browse/FLINK-14393
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / REST
>Affects Versions: 1.10.0
>Reporter: Yadong Xie
>Priority: Major
> Fix For: 1.10.0
>
>
> add the option to enable/disable cancel job in web ui
> when disabled, user can not cancel a job through the web ui



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-3555) Web interface does not render job information properly

2019-10-16 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-3555.

Resolution: Invalid

No longer valid.

> Web interface does not render job information properly
> --
>
> Key: FLINK-3555
> URL: https://issues.apache.org/jira/browse/FLINK-3555
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Web Frontend
>Affects Versions: 1.0.0
>Reporter: Till Rohrmann
>Assignee: Sergey_Sokur
>Priority: Minor
>  Labels: pull-request-available
> Attachments: Chrome.png, Safari.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In Chrome and Safari, the different tabs of the detailed job view are not 
> properly rendered. The text goes beyond the surrounding box. I would guess 
> that this is some kind of css issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14382) Wrong local FLINK_PLUGINS_DIR is set to flink-conf.yaml of jobmanager and taskmanager on Yarn

2019-10-15 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14382:
--
Affects Version/s: 1.10.0

> Wrong local FLINK_PLUGINS_DIR is set to flink-conf.yaml of jobmanager and 
> taskmanager on Yarn
> -
>
> Key: FLINK-14382
> URL: https://issues.apache.org/jira/browse/FLINK-14382
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.10.0
>Reporter: Yang Wang
>Priority: Major
>
> If we do not set FLINK_PLUGINS_DIR in flink-conf.yaml, it will be set to 
> [flink 
> configuration|https://github.com/apache/flink/blob/9e6ff81e22d6f5f04abb50ca1aea84fd2542bf9d/flink-core/src/main/java/org/apache/flink/configuration/GlobalConfiguration.java#L158]
>  according to the environment.
> In yarn mode, the local path will be set in flink-conf.yaml and used by 
> jobmanager and taskmanager. We will find the warning log like below. 
> {code:java}
> 2019-10-12 19:24:58,165 WARN  org.apache.flink.core.plugin.PluginConfig   
>   - Environment variable [FLINK_PLUGINS_DIR] is set to 
> [/Users/wangy/IdeaProjects/apache-flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins]
>  but the directory doesn't exist
> {code}
> It was in introduced by FLINK-12143.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14382) Wrong local FLINK_PLUGINS_DIR is set to flink-conf.yaml of jobmanager and taskmanager on Yarn

2019-10-15 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14382:
--
Fix Version/s: 1.9.2
   1.10.0

> Wrong local FLINK_PLUGINS_DIR is set to flink-conf.yaml of jobmanager and 
> taskmanager on Yarn
> -
>
> Key: FLINK-14382
> URL: https://issues.apache.org/jira/browse/FLINK-14382
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Yang Wang
>Priority: Major
> Fix For: 1.10.0, 1.9.2
>
>
> If we do not set FLINK_PLUGINS_DIR in flink-conf.yaml, it will be set to 
> [flink 
> configuration|https://github.com/apache/flink/blob/9e6ff81e22d6f5f04abb50ca1aea84fd2542bf9d/flink-core/src/main/java/org/apache/flink/configuration/GlobalConfiguration.java#L158]
>  according to the environment.
> In yarn mode, the local path will be set in flink-conf.yaml and used by 
> jobmanager and taskmanager. We will find the warning log like below. 
> {code:java}
> 2019-10-12 19:24:58,165 WARN  org.apache.flink.core.plugin.PluginConfig   
>   - Environment variable [FLINK_PLUGINS_DIR] is set to 
> [/Users/wangy/IdeaProjects/apache-flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins]
>  but the directory doesn't exist
> {code}
> It was in introduced by FLINK-12143.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14382) Wrong local FLINK_PLUGINS_DIR is set to flink-conf.yaml of jobmanager and taskmanager on Yarn

2019-10-15 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951794#comment-16951794
 ] 

Till Rohrmann commented on FLINK-14382:
---

[~1u0] could you take a look at this problem?

> Wrong local FLINK_PLUGINS_DIR is set to flink-conf.yaml of jobmanager and 
> taskmanager on Yarn
> -
>
> Key: FLINK-14382
> URL: https://issues.apache.org/jira/browse/FLINK-14382
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Reporter: Yang Wang
>Priority: Major
>
> If we do not set FLINK_PLUGINS_DIR in flink-conf.yaml, it will be set to 
> [flink 
> configuration|https://github.com/apache/flink/blob/9e6ff81e22d6f5f04abb50ca1aea84fd2542bf9d/flink-core/src/main/java/org/apache/flink/configuration/GlobalConfiguration.java#L158]
>  according to the environment.
> In yarn mode, the local path will be set in flink-conf.yaml and used by 
> jobmanager and taskmanager. We will find the warning log like below. 
> {code:java}
> 2019-10-12 19:24:58,165 WARN  org.apache.flink.core.plugin.PluginConfig   
>   - Environment variable [FLINK_PLUGINS_DIR] is set to 
> [/Users/wangy/IdeaProjects/apache-flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins]
>  but the directory doesn't exist
> {code}
> It was in introduced by FLINK-12143.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14382) Wrong local FLINK_PLUGINS_DIR is set to flink-conf.yaml of jobmanager and taskmanager on Yarn

2019-10-15 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14382:
--
Affects Version/s: 1.9.0

> Wrong local FLINK_PLUGINS_DIR is set to flink-conf.yaml of jobmanager and 
> taskmanager on Yarn
> -
>
> Key: FLINK-14382
> URL: https://issues.apache.org/jira/browse/FLINK-14382
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Yang Wang
>Priority: Major
>
> If we do not set FLINK_PLUGINS_DIR in flink-conf.yaml, it will be set to 
> [flink 
> configuration|https://github.com/apache/flink/blob/9e6ff81e22d6f5f04abb50ca1aea84fd2542bf9d/flink-core/src/main/java/org/apache/flink/configuration/GlobalConfiguration.java#L158]
>  according to the environment.
> In yarn mode, the local path will be set in flink-conf.yaml and used by 
> jobmanager and taskmanager. We will find the warning log like below. 
> {code:java}
> 2019-10-12 19:24:58,165 WARN  org.apache.flink.core.plugin.PluginConfig   
>   - Environment variable [FLINK_PLUGINS_DIR] is set to 
> [/Users/wangy/IdeaProjects/apache-flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins]
>  but the directory doesn't exist
> {code}
> It was in introduced by FLINK-12143.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14309) Kafka09ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink fails on Travis

2019-10-15 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951792#comment-16951792
 ] 

Till Rohrmann commented on FLINK-14309:
---

We usually do backport test instability fixes to all release branches which are 
still officially supported and are affected by the test instability. In this 
case it would be {{release-1.9}} and {{release-1.8}}. I think if this fixes the 
problems then we should backport it because it will make the release of bug fix 
releases easier.

> Kafka09ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink
>  fails on Travis
> --
>
> Key: FLINK-14309
> URL: https://issues.apache.org/jira/browse/FLINK-14309
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Jiayi Liao
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The 
> {{Kafka09ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink}}
>  fails on Travis with 
> {code}
> Test 
> testOneToOneAtLeastOnceRegularSink(org.apache.flink.streaming.connectors.kafka.Kafka09ProducerITCase)
>  failed with:
> java.lang.AssertionError: Job should fail!
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnce(KafkaProducerTestBase.java:280)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink(KafkaProducerTestBase.java:206)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> https://api.travis-ci.com/v3/job/240747188/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-14215) Add Docs for TM and JM Environment Variable Setting

2019-10-15 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-14215.
---
Fix Version/s: 1.8.3
   1.10.0
   Resolution: Done

Done via

1.10.0: ca5e51822e5139702dae3e85f9237cdb41d4ff70
1.9.2: f758debb8c8bdf6d70fe1e0934e7cfd3a1ad2e0f
1.8.3: d5285d54ef0b44ea7d479b0d5289078fab1f5da8

> Add Docs for TM and JM Environment Variable Setting
> ---
>
> Key: FLINK-14215
> URL: https://issues.apache.org/jira/browse/FLINK-14215
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.9.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.8.3, 1.9.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add description for 
>   /**
>* Prefix for passing custom environment variables to Flink's master 
> process.
>* For example for passing LD_LIBRARY_PATH as an env variable to the 
> AppMaster, set:
>* containerized.master.env.LD_LIBRARY_PATH: "/usr/lib/native"
>* in the flink-conf.yaml.
>*/
>   public static final String CONTAINERIZED_MASTER_ENV_PREFIX = 
> "containerized.master.env.";
>   /**
>* Similar to the {@see CONTAINERIZED_MASTER_ENV_PREFIX}, this 
> configuration prefix allows
>* setting custom environment variables for the workers (TaskManagers).
>*/
>   public static final String CONTAINERIZED_TASK_MANAGER_ENV_PREFIX = 
> "containerized.taskmanager.env.";



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-13982) Implement memory calculation logics

2019-10-14 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-13982.
---
Fix Version/s: 1.10.0
   Resolution: Done

Done via

2f944ba3c71d30f289f60306222f566315fb6039
9b14f93eeb856c98c980fb40337ee74c488d3973
cf337043236c852eaa16e4e51d6c4e95d9a6d056
59eca61bff3965a71cdb16865050c3daa0c8014b
d571b2bcab8188ae12e47a63ea4cc5d4583fb7de

> Implement memory calculation logics
> ---
>
> Key: FLINK-13982
> URL: https://issues.apache.org/jira/browse/FLINK-13982
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Xintong Song
>Assignee: Xintong Song
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> * Introduce new configuration options
>  * Introduce data structures and utilities.
>  ** Data structure to store memory / pool sizes of task executor
>  ** Utility for calculating memory / pool sizes from configuration
>  ** Utility for generating dynamic configurations
>  ** Utility for generating JVM parameters
> This step should not introduce any behavior changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14316) stuck in "Job leader ... lost leadership" error

2019-10-14 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951016#comment-16951016
 ] 

Till Rohrmann commented on FLINK-14316:
---

[~stevenz3wu] sorry for being unresponsive the last couple of days. I'll still 
have this issue on my to do list and will look into it in the next couple of 
days. I hope to be able to give you an answer to your questions by then.

> stuck in "Job leader ... lost leadership" error
> ---
>
> Key: FLINK-14316
> URL: https://issues.apache.org/jira/browse/FLINK-14316
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.7.2
>Reporter: Steven Zhen Wu
>Priority: Major
> Attachments: FLINK-14316.tgz, RpcConnection.patch
>
>
> This is the first exception caused restart loop. Later exceptions are the 
> same. Job seems to stuck in this permanent failure state.
> {code}
> 2019-10-03 21:42:46,159 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: 
> clpevents -> device_filter -> processed_imps -> ios_processed_impression -> i
> mps_ts_assigner (449/1360) (d237f5e99b6a4a580498821473763edb) switched from 
> SCHEDULED to FAILED.
> java.lang.Exception: Job leader for job id ecb9ad9be934edf7b1a4f7b9dd6df365 
> lost leadership.
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1526)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14318) JDK11 build stalls during shading

2019-10-12 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949972#comment-16949972
 ] 

Till Rohrmann commented on FLINK-14318:
---

Another instance: https://api.travis-ci.org/v3/job/596712083/log.txt

> JDK11 build stalls during shading
> -
>
> Key: FLINK-14318
> URL: https://issues.apache.org/jira/browse/FLINK-14318
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.10.0
>Reporter: Gary Yao
>Priority: Critical
>  Labels: test-stability
>
> JDK11 build stalls during shading.
> Travis stage: e2d - misc - jdk11
> https://travis-ci.org/apache/flink/builds/593022581?utm_source=slack_medium=notification
> https://api.travis-ci.org/v3/job/593022629/log.txt
> Relevant excerpt from logs:
> {noformat}
> 01:53:43.889 [INFO] 
> 
> 01:53:43.889 [INFO] Building flink-metrics-reporter-prometheus-test 
> 1.10-SNAPSHOT
> 01:53:43.889 [INFO] 
> 
> ...
> 01:53:44.508 [INFO] Including 
> org.apache.flink:force-shading:jar:1.10-SNAPSHOT in the shaded jar.
> 01:53:44.508 [INFO] Excluding org.slf4j:slf4j-api:jar:1.7.15 from the shaded 
> jar.
> 01:53:44.508 [INFO] Excluding com.google.code.findbugs:jsr305:jar:1.3.9 from 
> the shaded jar.
> 01:53:44.508 [INFO] No artifact matching filter io.netty:netty
> 01:53:44.522 [INFO] Replacing original artifact with shaded artifact.
> 01:53:44.523 [INFO] Replacing 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT.jar
>  with 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-shaded.jar
> 01:53:44.524 [INFO] Replacing original test artifact with shaded test 
> artifact.
> 01:53:44.524 [INFO] Replacing 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-tests.jar
>  with 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-shaded-tests.jar
> 01:53:44.524 [INFO] Dependency-reduced POM written at: 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/dependency-reduced-pom.xml
> No output has been received in the last 10m0s, this potentially indicates a 
> stalled build or something wrong with the build itself.
> Check the details on how to adjust your build configuration on: 
> https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
> The build has been terminated
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-12122) Spread out tasks evenly across all available registered TaskManagers

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-12122:
-

Assignee: Till Rohrmann

> Spread out tasks evenly across all available registered TaskManagers
> 
>
> Key: FLINK-12122
> URL: https://issues.apache.org/jira/browse/FLINK-12122
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.6.4, 1.7.2, 1.8.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Attachments: image-2019-05-21-12-28-29-538.png, 
> image-2019-05-21-13-02-50-251.png
>
>
> With Flip-6, we changed the default behaviour how slots are assigned to 
> {{TaskManages}}. Instead of evenly spreading it out over all registered 
> {{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a 
> tendency to first fill up a TM before using another one. This is a regression 
> wrt the pre Flip-6 code.
> I suggest to change the behaviour so that we try to evenly distribute slots 
> across all available {{TaskManagers}} by considering how many of their slots 
> are already allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-14343) Remove uncompleted YARNHighAvailabilityService

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-14343.
---
Resolution: Done

Removed via 80b27a150026b7b5cb707bd9fa3e17f565bb8112

> Remove uncompleted YARNHighAvailabilityService
> --
>
> Key: FLINK-14343
> URL: https://issues.apache.org/jira/browse/FLINK-14343
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Coordination
>Reporter: Zili Chen
>Assignee: Zili Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Corresponding mailing list 
> [thread|https://lists.apache.org/x/thread.html/6022f2124be91e3f4667d61a977ea0639e2c19286560d6d1cb874792@%3Cdev.flink.apache.org%3E].
> Noticed that there are several stale & uncompleted high-availability services 
> implementations, I start this thread in order to see whether or not we can 
> remove them for a
> clean codebase.
> Below are all of classes I noticed.
> - YarnHighAvailabilityServices
> - AbstractYarnNonHaServices
> - YarnIntraNonHaMasterServices
> - YarnPreConfiguredMasterNonHaServices
> - SingleLeaderElectionService
> - FsNegativeRunningJobsRegistry
> (as well as their dedicated tests)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-14347.
---
Fix Version/s: (was: 1.9.1)
   1.9.2
   Resolution: Fixed

Fixed via

1.10.0: fbfa8beb6847ed14641399afbbfd9a378d91e6f5
1.9.2: 01c97d56deffdfdd27f2f3e7fcdf734fd166257d
1.8.3: 903ac2135a0932dacce40f158a9c1a7afd50f564

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Assignee: Zili Chen
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.8.3, 1.9.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-7738) Create WebSocket handler (server)

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-7738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-7738.

Resolution: Abandoned

> Create WebSocket handler (server)
> -
>
> Key: FLINK-7738
> URL: https://issues.apache.org/jira/browse/FLINK-7738
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Mesos, Runtime / Coordination
>Reporter: Eron Wright
>Priority: Major
>  Labels: pull-request-available
>
> An abstract handler is needed to support websocket communication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-7022) Flink Job Manager Scheduler & Web Frontend out of sync when Zookeeper is unavailable on startup

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-7022.

Resolution: Not A Problem

No longer a problem.

> Flink Job Manager Scheduler & Web Frontend out of sync when Zookeeper is 
> unavailable on startup
> ---
>
> Key: FLINK-7022
> URL: https://issues.apache.org/jira/browse/FLINK-7022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Environment: Kubernetes cluster running:
> * Flink 1.3.0 Job Manager & Task Manager on Java 8u131
> * Zookeeper 3.4.10 cluster with 3 nodes
>Reporter: Scott Kidder
>Priority: Major
>
> h2. Problem
> Flink Job Manager web frontend is permanently unavailable if one or more 
> Zookeeper nodes are unresolvable during startup. The job scheduler eventually 
> recovers and assigns jobs to task managers, but the web frontend continues to 
> respond with an HTTP 503 and the following message:
> {noformat}Service temporarily unavailable due to an ongoing leader election. 
> Please refresh.{noformat}
> h2. Expected Behavior
> Once Flink is able to interact with Zookeeper successfully, all aspects of 
> the Job Manager (job scheduling & the web frontend) should be available.
> h2. Environment Details
> We're running Flink and Zookeeper in Kubernetes on CoreOS. CoreOS can run in 
> a configuration that automatically detects and applies operating system 
> updates. We have a Zookeeper node running on the same CoreOS instance as 
> Flink. It's possible that the Zookeeper node will not yet be started when the 
> Flink components are started. This could cause hostname resolution of the 
> Zookeeper nodes to fail.
> h3. Flink Task Manager Logs
> {noformat}
> 2017-06-27 15:38:47,161 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: metrics.reporter.statsd.host, localhost
> 2017-06-27 15:38:47,161 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: metrics.reporter.statsd.port, 8125
> 2017-06-27 15:38:47,162 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: metrics.reporter.statsd.interval, 10 SECONDS
> 2017-06-27 15:38:47,254 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: state.backend, filesystem
> 2017-06-27 15:38:47,254 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: state.backend.fs.checkpointdir, 
> hdfs://hdfs:8020/flink/checkpoints
> 2017-06-27 15:38:47,255 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: state.savepoints.dir, 
> hdfs://hdfs:8020/flink/savepoints
> 2017-06-27 15:38:47,255 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: recovery.mode, zookeeper
> 2017-06-27 15:38:47,256 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: recovery.zookeeper.quorum, 
> zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
> 2017-06-27 15:38:47,256 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: recovery.zookeeper.storageDir, 
> hdfs://hdfs:8020/flink/recovery
> 2017-06-27 15:38:47,256 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: recovery.jobmanager.port, 6123
> 2017-06-27 15:38:47,257 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: blob.server.port, 41479
> 2017-06-27 15:38:47,357 WARN  org.apache.flink.configuration.Configuration
>   - Config uses deprecated configuration key 'recovery.mode' 
> instead of proper key 'high-availability'
> 2017-06-27 15:38:47,366 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>   - Starting JobManager with high-availability
> 2017-06-27 15:38:47,366 WARN  org.apache.flink.configuration.Configuration
>   - Config uses deprecated configuration key 
> 'recovery.jobmanager.port' instead of proper key 
> 'high-availability.jobmanager.port'
> 2017-06-27 15:38:47,452 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>   - Starting JobManager on flink:6123 with execution mode CLUSTER
> 2017-06-27 15:38:47,549 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: jobmanager.rpc.address, flink
> 2017-06-27 15:38:47,549 INFO  
> org.apache.flink.configuration.GlobalConfiguration- 

[jira] [Commented] (FLINK-7009) dogstatsd mode in statsd reporter

2019-10-11 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949331#comment-16949331
 ] 

Till Rohrmann commented on FLINK-7009:
--

[~chesnay], do we want to have the PR in Flink? If not, then we could close it 
and mark this issue as abandoned.

> dogstatsd mode in statsd reporter
> -
>
> Key: FLINK-7009
> URL: https://issues.apache.org/jira/browse/FLINK-7009
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.4.0
> Environment: org.apache.flink.metrics.statsd.StatsDReporter
>Reporter: David Brinegar
>Priority: Major
>
> The current statsd reporter can only report a subset of Flink metrics owing 
> to the manner in which Flink variables are handled, mainly around invalid 
> characters and metrics too long.  As an option, it would be quite useful to 
> have a stricter dogstatsd compliant output.  Dogstatsd metrics are tagged, 
> should be less than 200 characters including tag names and values, be 
> alphanumeric + underbar, delimited by periods.  As a further pragmatic 
> restriction, negative and other invalid values should be ignored rather than 
> sent to the backend.  These restrictions play well with a broad set of 
> collectors and time series databases.
> This mode would:
> * convert output to ascii alphanumeric characters with underbar, delimited by 
> periods.  Runs of invalid characters within a metric segment would be 
> collapsed to a single underbar.
> * report all Flink variables as tags
> * compress overly long segments, say over 50 chars, to a symbolic 
> representation of the metric name, to preserve the unique metric time series 
> but avoid downstream truncation
> * compress 32 character Flink IDs like tm_id, task_id, job_id, 
> task_attempt_id, to the first 8 characters, again to preserve enough 
> distinction amongst metrics while trimming up to 96 characters from the metric
> * remove object references from names, such as the instance hash id of the 
> serializer
> * drop negative or invalid numeric values such as "n/a", "-1" which is used 
> for unknowns like JVM.Memory.NonHeap.Max, and "-9223372036854775808" which is 
> used for unknowns like currentLowWaterMark
> With these in place, it becomes quite reasonable to support LatencyGauge 
> metrics as well.
> One idea for symbolic compression is to take the first 10 valid characters 
> plus a hash of the long name.  For example, a value like this operator_name:
> {code:java}
> TriggerWindow(TumblingProcessingTimeWindows(5000), 
> ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.PojoSerializer@f3395ffa,
>  
> reduceFunction=org.apache.flink.streaming.examples.socket.SocketWindowWordCount$1@4201c465},
>  ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java-301))
> {code}
> would first drop the instance references.  The stable version would be:
>  
> {code:java}
> TriggerWindow(TumblingProcessingTimeWindows(5000), 
> ReducingStateDescriptor{serializer=org.apache.flink.api.java.typeutils.runtime.PojoSerializer,
>  
> reduceFunction=org.apache.flink.streaming.examples.socket.SocketWindowWordCount$1},
>  ProcessingTimeTrigger(), WindowedStream.reduce(WindowedStream.java-301))
> {code}
> and then the compressed name would be the first ten valid characters plus the 
> hash of the stable string:
> {code}
> TriggerWin_d8c007da
> {code}
> This is just one way of dealing with unruly default names, the main point 
> would be to preserve the metrics so they are valid, avoid truncation, and can 
> be aggregated along other dimensions even if this particular dimension is 
> hard to parse after the compression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6972) Flink REPL api

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6972.

Resolution: Won't Do

> Flink REPL api
> --
>
> Key: FLINK-6972
> URL: https://issues.apache.org/jira/browse/FLINK-6972
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Reporter: Praveen Kanamarlapudi
>Priority: Major
>
> Can you please develop FlinkIMap (Similar to 
> [SparkIMain|https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala])
>  developer api for creating interactive sessions.
> I am thinking to add flink support to 
> [livy|https://github.com/cloudera/livy/]. For enabling flink interactive 
> sessions it would be really helpful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6910) Metrics value for lastCheckpointExternalPath is not valid

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6910.

Resolution: Abandoned

> Metrics value for lastCheckpointExternalPath is not valid
> -
>
> Key: FLINK-6910
> URL: https://issues.apache.org/jira/browse/FLINK-6910
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.3.0
> Environment: Testing against a Telegraf StatsD server.
>Reporter: Chris Dail
>Priority: Minor
>
> The value for the lastCheckpointExternalPath metric seems to be a string (I'm 
> seeing a value of 'n/a'). The metric type is a gauge which is required to be 
> a float.
> Telegraf gives an error on this:
> {noformat}
> 2017-06-12T15:02:21Z E! Error: parsing value to float64: 
> bce343c5-393a-4f70-bd0c-f7578443c9a4-0005.flink__9ef0b9a8-533a-4b5e-8667-1967e9cae849.a473fcfe-0907-4084-8eff-f5f0c3ea1844.flink.jm.TurbineHeatProcessor_examples_turbineHeatTest.lastCheckpointExternalPath:n/a|g
> {noformat}
> This is pretty minor since telegraf will ignore this an continue on. Other 
> StatsD backends may behave differently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-6900) Limit size of indiivual components in DropwizardReporter

2019-10-11 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949329#comment-16949329
 ] 

Till Rohrmann commented on FLINK-6900:
--

[~chesnay] is this still a valid problem? Is the PR still up to date?

> Limit size of indiivual components in DropwizardReporter
> 
>
> Key: FLINK-6900
> URL: https://issues.apache.org/jira/browse/FLINK-6900
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6741) In yarn cluster model with high available, the HDFS file is not deleted when cluster is shot down

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6741.

Resolution: Abandoned

> In yarn cluster model with high available, the HDFS file is not deleted when 
> cluster is shot down
> -
>
> Key: FLINK-6741
> URL: https://issues.apache.org/jira/browse/FLINK-6741
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Reporter: zhangrucong1982
>Assignee: zhangrucong1982
>Priority: Major
>
> The flink version of 1.3.0 rc2. I use yarn cluster with high available.
> 1、the mainly configuration is:
> high-availability.zookeeper.storageDir: hdfs:///flink/recovery.
> 2、I use the command "./yarn-session.sh -n 2 -d" to start a cluster;
> 3、I use the command "./flink run ../example/streaming/WindowJoin.rar" to 
> summit a job;
> 4、I use the flink cancel command to cancel job;
> 5、 I use the “./yarn-session.sh -id ” command to attach to the yarn 
> cluster and use stop command to shutdown the cluster.
> 6、After shutdown the cluster,I find the file in hdfs is not deleted. Like the 
> following:
> /flink/recovery/application_1495781150990_0006/blob/cache/blob_9b2a6f6535819075889ebcf64490f4e6528b07



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6673) Yarn single-job submission doesn't fail early when to few slows are available

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6673.

Resolution: Not A Problem

not a problem anymore

> Yarn single-job submission doesn't fail early when to few slows are available
> -
>
> Key: FLINK-6673
> URL: https://issues.apache.org/jira/browse/FLINK-6673
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Coordination
>Affects Versions: 1.3.0
>Reporter: Chesnay Schepler
>Priority: Major
>
> When submitting a single job to yarn that requires more slots than are 
> available (in total) the job ends up in a restart-loop, instead of being 
> canceled right away.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6668) Add flink history server to DCOS

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6668.

Resolution: Abandoned

> Add flink history server to DCOS
> 
>
> Key: FLINK-6668
> URL: https://issues.apache.org/jira/browse/FLINK-6668
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Mesos
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> We need to have history server within dc/os env as with the spark case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6664) Extend Restart Strategy for Fine-grained Recovery and missing Resources

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6664.

Resolution: Abandoned

Superseded by the scheduler refactoring: FLINK-10429

> Extend Restart Strategy for Fine-grained Recovery and missing Resources
> ---
>
> Key: FLINK-6664
> URL: https://issues.apache.org/jira/browse/FLINK-6664
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.3.0
>Reporter: Stephan Ewen
>Priority: Major
>
> This is the umbrella issue to extend the {{RestartStrategy}} to handle 
> different kinds of failures / recoveries better:
>   - Differentiation between local- and global failover
>   - Better abstraction over the execution graph
>   - Delaying restarts without blocking threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6666) RestartStrategy should differentiate between types of recovery (global / local / resource missing)

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-.

Resolution: Abandoned

Superseded by the scheduler refactoring.

> RestartStrategy should differentiate between types of recovery (global / 
> local / resource missing)
> --
>
> Key: FLINK-
> URL: https://issues.apache.org/jira/browse/FLINK-
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.3.0
>Reporter: Stephan Ewen
>Priority: Major
>
> Currently, the {{RestrartStrategy}} has a single method that is called when a 
> failure requires an ExecutionGraph restart.
> With the new addition of incremental recovery, it is desirable to distinguish 
> between the type of failover that happens.
> I would suggest to extend the {{RestartStrategy}} to support three 
> cases/methods:
>   - {{restartGlobal()}} for a full restart recovery
>   - {{restartLocal()}} for a recovery coordinated by the {{FailoverStrategy}}
>   - {{restartOnMissingResources()}} if the failure cause was missing slots
> The last case is interesting, in my opinion, because it is commonly desirable 
> that regular failover has no delay, but failover on missing resources has a 
> short delay (1s or so) to avoid very fast cycles of restart attempts (in 
> standalone mode, there can easily be 100,000 restarts after a second, when no 
> resources are available and no delay happens during restarts).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6627) Expose tmp directories via API

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6627.

Resolution: Abandoned

> Expose tmp directories via API
> --
>
> Key: FLINK-6627
> URL: https://issues.apache.org/jira/browse/FLINK-6627
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.3.2, 1.4.0
>Reporter: Andrey
>Priority: Major
>
> Currently tmp/blob directories created based on fixed baseDir and random 
> postfix. For example blob directory:
> {code}
> new File(baseDir, String.format("blobStore-%s", UUID.randomUUID().toString()))
> {code}
> This directory name is not exposed externally. This will cause several issues 
> in the following scenario:
> 1) Start 1 task manager
> 2) random blob directory created. For example: "blob-1"
> 3) Start 2 task manager
> 4) random blob directory created. For example: "blob-2"
> 5) 1 task manager dies unexpectedly. (kill -9 or OOM).
> 6) directory "blob-1" will not be deleted.
> 7) 1 task manager automatically restarted
> 8) random blob directory created. For example: "blob-3"
> The issues:
> * The directory "blob-1" will never be deleted. 
> * The external cleanup script cannot get information about current 
> directories being in use. Because information is not exposed externally. So 
> it cannot delete unused directories.
> * Sorting directories by "created time" and keeping last X, won't help, 
> because 1 faulty task manager could generate X+1 new directories.
> * giving different "blob.storage.directory" for different task managers is 
> not a scalable solution for cloud/docker deployment, because there should be 
> central storage for current number of running task managers.
> Proposed solution:
> * expose via rest API current working directory for blob/tmp. In that case: 
> ** cleanup script could get all blob/tmp directories being in use from the 
> cluster
> ** get all blob/tmp directories ("ls")
> ** find blob/tmp directories not being used. 
> ** delete them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-6626) Unifying lifecycle management of SubmittedJobGraph- and CompletedCheckpointStore

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-6626:


Assignee: (was: Tarush Grover)

> Unifying lifecycle management of SubmittedJobGraph- and 
> CompletedCheckpointStore
> 
>
> Key: FLINK-6626
> URL: https://issues.apache.org/jira/browse/FLINK-6626
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Priority: Major
>
> Currently, Flink uses the {{SubmittedJobGraphStore}} to persist {{JobGraphs}} 
> such that they can be recovered in case of failures. The 
> {{SubmittedJobGraphStore}} is managed by by the {{JobManager}}. Additionally, 
> Flink has the {{CompletedCheckpointStore}} which stores checkpoints for a 
> given {{ExecutionGraph}}/job. The {{CompletedCheckpointStore}} is managed by 
> the {{CheckpointCoordinator}}.
> The {{SubmittedJobGraphStore}} and the {{CompletedCheckpointStore}} are 
> somewhat related because in the latter we store checkpoints for jobs 
> contained in the former. I think it would be nice wrt lifecycle management to 
> let the {{SubmittedJobGraphStore}} manage the lifecycle of the 
> {{CompletedCheckpointStore}}, because often it does not make much sense to 
> keep only checkpoints without a job or a job without checkpoints. 
> An idea would be when we register a job with the {{SubmittedJobGraphStore}} 
> then it returns a {{CompletedCheckpointStore}}. This store can then be given 
> to the {{CheckpointCoordinator}} to store the checkpoints. When a job enters 
> a terminal state it could be the responsibility of the 
> {{SubmittedJobGraphStore}} to decide what to do with the job data 
> ({{JobGraph}} and {{Checkpoints}}), e.g. keeping it or cleaning it up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6625) Flink removes HA job data when reaching JobStatus.FAILED

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6625.

Resolution: Won't Do

> Flink removes HA job data when reaching JobStatus.FAILED
> 
>
> Key: FLINK-6625
> URL: https://issues.apache.org/jira/browse/FLINK-6625
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Priority: Major
>
> Currently, Flink removes all job related data (submitted {{JobGraph}} as well 
> as checkpoints) when it reaches a globally terminal state (including 
> {{JobStatus.FAILED}}). In high availability mode, this entails that all data 
> is removed from ZooKeeper and there is no way to recover the job by 
> restarting the cluster with the same cluster id.
> I think this is problematic, since an application might just have failed 
> because it has depleted its numbers of restart attempts. Also the last 
> checkpoint information could be helpful when trying to find out why the job 
> has actually failed. I propose that we only remove job data when reaching the 
> state {{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6594) Implement Flink Dispatcher for Kubernetes

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6594.

Resolution: Abandoned

> Implement Flink Dispatcher for Kubernetes
> -
>
> Key: FLINK-6594
> URL: https://issues.apache.org/jira/browse/FLINK-6594
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Reporter: Larry Wu
>Assignee: Larry Wu
>Priority: Major
>  Labels: Kubernetes
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This task is to implement Flink Dispatcher for Kubernetes, which is deployed 
> to Kubernetes cluster as a long-running pod. The Flink Dispatcher accepts job 
> submissions from Flink clients and asks Kubernetes API Server to create and 
> monitor a virtual cluster of Flink JobManager pod and Flink TaskManager Pods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6526) BlobStore files might become orphans in case of recovery

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6526.

Resolution: Abandoned

> BlobStore files might become orphans in case of recovery
> 
>
> Key: FLINK-6526
> URL: https://issues.apache.org/jira/browse/FLINK-6526
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Priority: Major
>
> The {{BlobStore}} is used to store {{BlobServer}} files persistently if HA is 
> enabled. The {{BlobLibraryCacheManager}} is responsible for keeping track of 
> a reference count for each file. Once the count is {{0}} the 
> {{BlobLibraryCacheManager}} will eventually delete this file from the 
> {{BlobServer}} and also the {{BlobStore}}. In case of recovery, the 
> {{BlobLibraryCacheManager}} will only recover those files which are actively 
> asked for (e.g. jar files of new job submission or job recovery). All other 
> files which might have had a reference count of {{0}} and were supposed to be 
> eventually deleted, won't be reregistered on the {{BlobLibraryCacheManager}}. 
> Consequently, these files will never be deleted and remain on the BlobStore 
> for all eternity.
> I think upon recovery, all files currently being held in the {{BlobStore}} 
> should be re-registered with the {{BlobLibraryCacheManager}} such that they 
> will be eventually deleted once they timed out with a reference count of 
> {{0}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6525) Transferred TM log/stdout files are never removed from BlobStore

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6525.

Resolution: Invalid

No longer valid.

> Transferred TM log/stdout files are never removed from BlobStore
> 
>
> Key: FLINK-6525
> URL: https://issues.apache.org/jira/browse/FLINK-6525
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Priority: Major
>
> The {{TaskManager}} uses the {{BlobClient}} to upload its stdout/log file to 
> the {{BlobServer}}. If HA mode is enabled, then these files will also be 
> uploaded to the {{BlobStore}}. Since the {{TaskManagerLogHandler}} only 
> cleans up files from a TM in case it has already received another file from 
> this TM and additionally does this in a non thread safe manner, it can easily 
> happen that files won't get cleaned up from the {{BlobStore}}.
> I think we should not upload these kind of files to the persistent/HA 
> {{BlobStore}}. We could do this by introducing a storage mode when uploading 
> files to the {{BlobServer}} (e.g. {{HA_STORAGE}} vs. {{LOCAL_STORAGE}}). 
> Additionally, we should also register a timeout for only locally stored files 
> or at least store them under its {{JobID}} such that these files are also 
> cleaned up once the job is being cleaned up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6524) Make server address of BlobCache configurable

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6524.

Resolution: Duplicate

Done as part of FLINK-8501.

> Make server address of BlobCache configurable
> -
>
> Key: FLINK-6524
> URL: https://issues.apache.org/jira/browse/FLINK-6524
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Priority: Major
>
> Currently, the {{BlobCache}} can only be started with a fixed address of the 
> {{BlobServer}} such that it know where to ask for blobs which are not already 
> cached. In order to connect to a new {{BlobServer}} a new {{BlobCache}} has 
> to be created, effectively deleting all previously cached files. This thwarts 
> the essential purpose of a cache and, thus, should be changed.
> I propose to make the server address configurable such that a single 
> {{BlobCache}} can be connected to a new {{BlobServer}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-6523) Improve lifecycle management of created BlobCaches in TaskManagerLogHandler and JobClient

2019-10-11 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-6523.

Resolution: Won't Do

No longer valid.

> Improve lifecycle management of created BlobCaches in TaskManagerLogHandler 
> and JobClient
> -
>
> Key: FLINK-6523
> URL: https://issues.apache.org/jira/browse/FLINK-6523
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Priority: Major
>
> Some components like the {{JobClient#retrieveClassLoader}} and 
> {{TaskManagerLogHandler}} create {{BlobCaches}} to retrieve data from the 
> {{BlobServer}}. These instances are created in a local context and their 
> lifecycle is not properly managed. The cleanup/closing solely relies on the 
> fact that their storage directory will be cleaned up eventually by a 
> registered shutdown hook.
> I think it would be better to properly implement a lifecycle management for 
> these components and close them at an appropriate location in the code. E.g. 
> one could pass in a shared {{BlobCache}} which is closed once the JobClient 
> action is over.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13567) Avro Confluent Schema Registry nightly end-to-end test failed on Travis

2019-10-11 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949313#comment-16949313
 ] 

Till Rohrmann commented on FLINK-13567:
---

I've disabled the test via 2c2095bdad3d47f27973a585112ed820f457de6f until the 
problem has been fixed.

> Avro Confluent Schema Registry nightly end-to-end test failed on Travis
> ---
>
> Key: FLINK-13567
> URL: https://issues.apache.org/jira/browse/FLINK-13567
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka, Tests
>Affects Versions: 1.8.2, 1.9.0, 1.10.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> The {{Avro Confluent Schema Registry nightly end-to-end test}} failed on 
> Travis with
> {code}
> [FAIL] 'Avro Confluent Schema Registry nightly end-to-end test' failed after 
> 2 minutes and 11 seconds! Test exited with exit code 1
> No taskexecutor daemon (pid: 29044) is running anymore on 
> travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501.
> No standalonesession daemon to stop on host 
> travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501.
> rm: cannot remove 
> '/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins':
>  No such file or directory
> {code}
> https://api.travis-ci.org/v3/job/567273939/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-13567) Avro Confluent Schema Registry nightly end-to-end test failed on Travis

2019-10-11 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949313#comment-16949313
 ] 

Till Rohrmann edited comment on FLINK-13567 at 10/11/19 9:30 AM:
-

I've disabled the test in {{master}} via 
2c2095bdad3d47f27973a585112ed820f457de6f until the problem has been fixed.


was (Author: till.rohrmann):
I've disabled the test via 2c2095bdad3d47f27973a585112ed820f457de6f until the 
problem has been fixed.

> Avro Confluent Schema Registry nightly end-to-end test failed on Travis
> ---
>
> Key: FLINK-13567
> URL: https://issues.apache.org/jira/browse/FLINK-13567
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka, Tests
>Affects Versions: 1.8.2, 1.9.0, 1.10.0
>Reporter: Till Rohrmann
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> The {{Avro Confluent Schema Registry nightly end-to-end test}} failed on 
> Travis with
> {code}
> [FAIL] 'Avro Confluent Schema Registry nightly end-to-end test' failed after 
> 2 minutes and 11 seconds! Test exited with exit code 1
> No taskexecutor daemon (pid: 29044) is running anymore on 
> travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501.
> No standalonesession daemon to stop on host 
> travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501.
> rm: cannot remove 
> '/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins':
>  No such file or directory
> {code}
> https://api.travis-ci.org/v3/job/567273939/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14309) Kafka09ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink fails on Travis

2019-10-11 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949307#comment-16949307
 ] 

Till Rohrmann commented on FLINK-14309:
---

Does it make sense to backport this fix to earlier release branches 
[~becket_qin]?

> Kafka09ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink
>  fails on Travis
> --
>
> Key: FLINK-14309
> URL: https://issues.apache.org/jira/browse/FLINK-14309
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Jiayi Liao
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The 
> {{Kafka09ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink}}
>  fails on Travis with 
> {code}
> Test 
> testOneToOneAtLeastOnceRegularSink(org.apache.flink.streaming.connectors.kafka.Kafka09ProducerITCase)
>  failed with:
> java.lang.AssertionError: Job should fail!
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnce(KafkaProducerTestBase.java:280)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink(KafkaProducerTestBase.java:206)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> https://api.travis-ci.com/v3/job/240747188/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14370) KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink fails on Travis

2019-10-11 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-14370:
-

 Summary: 
KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink
 fails on Travis
 Key: FLINK-14370
 URL: https://issues.apache.org/jira/browse/FLINK-14370
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.10.0
Reporter: Till Rohrmann


The 
{{KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink}}
 fails on Travis with

{code}
Test 
testOneToOneAtLeastOnceRegularSink(org.apache.flink.streaming.connectors.kafka.KafkaProducerAtLeastOnceITCase)
 failed with:
java.lang.AssertionError: Job should fail!
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnce(KafkaProducerTestBase.java:280)
at 
org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnceRegularSink(KafkaProducerTestBase.java:206)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
{code}

https://api.travis-ci.com/v3/job/244297223/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14369) KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator fails on Travis

2019-10-11 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-14369:
-

 Summary: 
KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator
 fails on Travis
 Key: FLINK-14369
 URL: https://issues.apache.org/jira/browse/FLINK-14369
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.10.0
Reporter: Till Rohrmann


The 
{{KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator}}
 fails on Travis with 

{code}
Test 
testOneToOneAtLeastOnceCustomOperator(org.apache.flink.streaming.connectors.kafka.KafkaProducerAtLeastOnceITCase)
 failed with:
java.lang.AssertionError: Create test topic : oneToOneTopicCustomOperator 
failed, org.apache.kafka.common.errors.TimeoutException: Timed out waiting for 
a node assignment.
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.flink.streaming.connectors.kafka.KafkaTestEnvironmentImpl.createTestTopic(KafkaTestEnvironmentImpl.java:180)
at 
org.apache.flink.streaming.connectors.kafka.KafkaTestEnvironment.createTestTopic(KafkaTestEnvironment.java:115)
at 
org.apache.flink.streaming.connectors.kafka.KafkaTestBase.createTestTopic(KafkaTestBase.java:196)
at 
org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnce(KafkaProducerTestBase.java:231)
at 
org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator(KafkaProducerTestBase.java:214)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
{code}

https://api.travis-ci.com/v3/job/244297223/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14368) KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testCustomPartitioning failed on Travis

2019-10-11 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-14368:
-

 Summary: 
KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testCustomPartitioning 
failed on Travis
 Key: FLINK-14368
 URL: https://issues.apache.org/jira/browse/FLINK-14368
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.10.0
Reporter: Till Rohrmann


The 
{{KafkaProducerAtLeastOnceITCase>KafkaProducerTestBase.testCustomPartitioning}} 
fails on Travis with

{code}
Test 
testCustomPartitioning(org.apache.flink.streaming.connectors.kafka.KafkaProducerAtLeastOnceITCase)
 failed with:
java.lang.AssertionError: Create test topic : defaultTopic failed, 
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
assignment.
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.flink.streaming.connectors.kafka.KafkaTestEnvironmentImpl.createTestTopic(KafkaTestEnvironmentImpl.java:180)
at 
org.apache.flink.streaming.connectors.kafka.KafkaTestEnvironment.createTestTopic(KafkaTestEnvironment.java:115)
at 
org.apache.flink.streaming.connectors.kafka.KafkaTestBase.createTestTopic(KafkaTestBase.java:196)
at 
org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testCustomPartitioning(KafkaProducerTestBase.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
{code}

https://api.travis-ci.com/v3/job/244297223/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14224) Kafka010ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator fails on Travis

2019-10-11 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949276#comment-16949276
 ] 

Till Rohrmann commented on FLINK-14224:
---

[~1u0] have you tried looping the test on Travis?

> Kafka010ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator
>  fails on Travis
> --
>
> Key: FLINK-14224
> URL: https://issues.apache.org/jira/browse/FLINK-14224
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Alex
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> The 
> {{Kafka010ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator}}
>  fails on Travis with
> {code}
> Test 
> testOneToOneAtLeastOnceCustomOperator(org.apache.flink.streaming.connectors.kafka.Kafka010ProducerITCase)
>  failed with:
> java.lang.AssertionError: Job should fail!
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnce(KafkaProducerTestBase.java:280)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator(KafkaProducerTestBase.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> https://api.travis-ci.com/v3/job/238920411/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14298) Change LeaderContender#getAddress into LeaderContender#getDescription

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14298.
-
Resolution: Done

Done via 4813a2396cc7a1a054b2c46c673ea96ced982523

> Change LeaderContender#getAddress into LeaderContender#getDescription
> -
>
> Key: FLINK-14298
> URL: https://issues.apache.org/jira/browse/FLINK-14298
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After the changes introduced with FLINK-14287, we no longer need the 
> {{LeaderContender}} to have a method {{getAddress}}. Instead we could change 
> it to {{getDescription}} which is solely used for logging purposes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14307) Extract JobGraphWriter from JobGraphStore

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14307.
-
Resolution: Done

Done via cda6dc0c44239aa7a36105988328de5744aea125

> Extract JobGraphWriter from JobGraphStore
> -
>
> Key: FLINK-14307
> URL: https://issues.apache.org/jira/browse/FLINK-14307
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In order to follow the ISP, I suggest to extract a {{JobGraphWriter}} 
> interface from the {{JobGraphStore}}. The background is that in the future, 
> the {{Dispatcher}} does not need to recover jobs anymore and, hence, it 
> should not know the recovery related methods of the {{JobGraphStore}}. 
> Instead, the only methods it needs to know are {{putJobGraph}}, 
> {{removeJobGraph}} and {{releaseJobGraph}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948810#comment-16948810
 ] 

Till Rohrmann commented on FLINK-14347:
---

I'd be ok with this approach. Can you work on this issue [~tison]?

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Priority: Critical
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14333) YARNSessionFIFOSecuredITCase fails on Travis

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14333.
-
Resolution: Duplicate

I think this is the same problem as FLINK-14347.

> YARNSessionFIFOSecuredITCase fails on Travis
> 
>
> Key: FLINK-14333
> URL: https://issues.apache.org/jira/browse/FLINK-14333
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
>
> https://travis-ci.org/apache/flink/builds/593958591
> {code}
> 20:59:33.116 [ERROR] Tests run: 4, Failures: 2, Errors: 0, Skipped: 2, Time 
> elapsed: 46.546 s <<< FAILURE! - in 
> org.apache.flink.yarn.YARNSessionFIFOSecuredITCase
> 20:59:33.116 [ERROR] 
> testDetachedMode(org.apache.flink.yarn.YARNSessionFIFOSecuredITCase)  Time 
> elapsed: 18.732 s  <<< FAILURE!
> java.lang.AssertionError: 
> Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-1_0/application_1570309129335_0001/container_1570309129335_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> [
> 2019-10-05 20:59:11,971 INFO  
> org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  - 
> Shutting down
> 2019-10-05 20:59:11,999 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 
> 4d21a21eee0bf5af6fc8616debddfcd4 reached globally terminal state FINISHED.
> 2019-10-05 20:59:12,026 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Stopping the JobMaster for job Streaming 
> WordCount(4d21a21eee0bf5af6fc8616debddfcd4).
> 2019-10-05 20:59:12,050 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Suspending 
> SlotPool.
> 2019-10-05 20:59:12,051 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Close ResourceManager connection 
> 8cbddf818ec206854b4ee4d75b08bf98: JobManager is shutting down..
> 2019-10-05 20:59:12,051 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Stopping 
> SlotPool.
> 2019-10-05 20:59:12,052 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Disconnect job manager 
> 0...@akka.tcp://flink@localhost:41357/user/jobmanager_0
>  for job 4d21a21eee0bf5af6fc8616debddfcd4 from the resource manager.
> 2019-10-05 20:59:12,062 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl   - 
> JobManagerRunner already shutdown.
> 2019-10-05 20:59:12,325 ERROR org.apache.flink.yarn.YarnResourceManager   
>   - Fatal error occurred in ResourceManager.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
>   at 
> org.apache.flink.yarn.YarnResourceManager.onShutdownRequest(YarnResourceManager.java:454)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:275)
> 2019-10-05 20:59:12,330 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error 
> occurred in the cluster entrypoint.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
>   at 
> org.apache.flink.yarn.YarnResourceManager.onShutdownRequest(YarnResourceManager.java:454)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:275)
> 2019-10-05 20:59:12,342 INFO  org.apache.flink.runtime.blob.BlobServer
>   - Stopped BLOB server at 0.0.0.0:34373
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14333) YARNSessionFIFOSecuredITCase fails on Travis

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14333:
--
Priority: Critical  (was: Major)

> YARNSessionFIFOSecuredITCase fails on Travis
> 
>
> Key: FLINK-14333
> URL: https://issues.apache.org/jira/browse/FLINK-14333
> Project: Flink
>  Issue Type: Task
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
>
> https://travis-ci.org/apache/flink/builds/593958591
> {code}
> 20:59:33.116 [ERROR] Tests run: 4, Failures: 2, Errors: 0, Skipped: 2, Time 
> elapsed: 46.546 s <<< FAILURE! - in 
> org.apache.flink.yarn.YARNSessionFIFOSecuredITCase
> 20:59:33.116 [ERROR] 
> testDetachedMode(org.apache.flink.yarn.YARNSessionFIFOSecuredITCase)  Time 
> elapsed: 18.732 s  <<< FAILURE!
> java.lang.AssertionError: 
> Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-1_0/application_1570309129335_0001/container_1570309129335_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> [
> 2019-10-05 20:59:11,971 INFO  
> org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  - 
> Shutting down
> 2019-10-05 20:59:11,999 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 
> 4d21a21eee0bf5af6fc8616debddfcd4 reached globally terminal state FINISHED.
> 2019-10-05 20:59:12,026 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Stopping the JobMaster for job Streaming 
> WordCount(4d21a21eee0bf5af6fc8616debddfcd4).
> 2019-10-05 20:59:12,050 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Suspending 
> SlotPool.
> 2019-10-05 20:59:12,051 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Close ResourceManager connection 
> 8cbddf818ec206854b4ee4d75b08bf98: JobManager is shutting down..
> 2019-10-05 20:59:12,051 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Stopping 
> SlotPool.
> 2019-10-05 20:59:12,052 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Disconnect job manager 
> 0...@akka.tcp://flink@localhost:41357/user/jobmanager_0
>  for job 4d21a21eee0bf5af6fc8616debddfcd4 from the resource manager.
> 2019-10-05 20:59:12,062 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl   - 
> JobManagerRunner already shutdown.
> 2019-10-05 20:59:12,325 ERROR org.apache.flink.yarn.YarnResourceManager   
>   - Fatal error occurred in ResourceManager.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
>   at 
> org.apache.flink.yarn.YarnResourceManager.onShutdownRequest(YarnResourceManager.java:454)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:275)
> 2019-10-05 20:59:12,330 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error 
> occurred in the cluster entrypoint.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
>   at 
> org.apache.flink.yarn.YarnResourceManager.onShutdownRequest(YarnResourceManager.java:454)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:275)
> 2019-10-05 20:59:12,342 INFO  org.apache.flink.runtime.blob.BlobServer
>   - Stopped BLOB server at 0.0.0.0:34373
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14333) YARNSessionFIFOSecuredITCase fails on Travis

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14333:
--
Issue Type: Test  (was: Task)

> YARNSessionFIFOSecuredITCase fails on Travis
> 
>
> Key: FLINK-14333
> URL: https://issues.apache.org/jira/browse/FLINK-14333
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0
>Reporter: Chesnay Schepler
>Priority: Critical
>  Labels: test-stability
>
> https://travis-ci.org/apache/flink/builds/593958591
> {code}
> 20:59:33.116 [ERROR] Tests run: 4, Failures: 2, Errors: 0, Skipped: 2, Time 
> elapsed: 46.546 s <<< FAILURE! - in 
> org.apache.flink.yarn.YARNSessionFIFOSecuredITCase
> 20:59:33.116 [ERROR] 
> testDetachedMode(org.apache.flink.yarn.YARNSessionFIFOSecuredITCase)  Time 
> elapsed: 18.732 s  <<< FAILURE!
> java.lang.AssertionError: 
> Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-1_0/application_1570309129335_0001/container_1570309129335_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> [
> 2019-10-05 20:59:11,971 INFO  
> org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  - 
> Shutting down
> 2019-10-05 20:59:11,999 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 
> 4d21a21eee0bf5af6fc8616debddfcd4 reached globally terminal state FINISHED.
> 2019-10-05 20:59:12,026 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Stopping the JobMaster for job Streaming 
> WordCount(4d21a21eee0bf5af6fc8616debddfcd4).
> 2019-10-05 20:59:12,050 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Suspending 
> SlotPool.
> 2019-10-05 20:59:12,051 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Close ResourceManager connection 
> 8cbddf818ec206854b4ee4d75b08bf98: JobManager is shutting down..
> 2019-10-05 20:59:12,051 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Stopping 
> SlotPool.
> 2019-10-05 20:59:12,052 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Disconnect job manager 
> 0...@akka.tcp://flink@localhost:41357/user/jobmanager_0
>  for job 4d21a21eee0bf5af6fc8616debddfcd4 from the resource manager.
> 2019-10-05 20:59:12,062 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl   - 
> JobManagerRunner already shutdown.
> 2019-10-05 20:59:12,325 ERROR org.apache.flink.yarn.YarnResourceManager   
>   - Fatal error occurred in ResourceManager.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
>   at 
> org.apache.flink.yarn.YarnResourceManager.onShutdownRequest(YarnResourceManager.java:454)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:275)
> 2019-10-05 20:59:12,330 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error 
> occurred in the cluster entrypoint.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
>   at 
> org.apache.flink.yarn.YarnResourceManager.onShutdownRequest(YarnResourceManager.java:454)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:275)
> 2019-10-05 20:59:12,342 INFO  org.apache.flink.runtime.blob.BlobServer
>   - Stopped BLOB server at 0.0.0.0:34373
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14333) YARNSessionFIFOSecuredITCase fails on Travis

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14333:
--
Labels: test-stability  (was: )

> YARNSessionFIFOSecuredITCase fails on Travis
> 
>
> Key: FLINK-14333
> URL: https://issues.apache.org/jira/browse/FLINK-14333
> Project: Flink
>  Issue Type: Task
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0
>Reporter: Chesnay Schepler
>Priority: Major
>  Labels: test-stability
>
> https://travis-ci.org/apache/flink/builds/593958591
> {code}
> 20:59:33.116 [ERROR] Tests run: 4, Failures: 2, Errors: 0, Skipped: 2, Time 
> elapsed: 46.546 s <<< FAILURE! - in 
> org.apache.flink.yarn.YARNSessionFIFOSecuredITCase
> 20:59:33.116 [ERROR] 
> testDetachedMode(org.apache.flink.yarn.YARNSessionFIFOSecuredITCase)  Time 
> elapsed: 18.732 s  <<< FAILURE!
> java.lang.AssertionError: 
> Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-1_0/application_1570309129335_0001/container_1570309129335_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:
> [
> 2019-10-05 20:59:11,971 INFO  
> org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  - 
> Shutting down
> 2019-10-05 20:59:11,999 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 
> 4d21a21eee0bf5af6fc8616debddfcd4 reached globally terminal state FINISHED.
> 2019-10-05 20:59:12,026 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Stopping the JobMaster for job Streaming 
> WordCount(4d21a21eee0bf5af6fc8616debddfcd4).
> 2019-10-05 20:59:12,050 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Suspending 
> SlotPool.
> 2019-10-05 20:59:12,051 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Close ResourceManager connection 
> 8cbddf818ec206854b4ee4d75b08bf98: JobManager is shutting down..
> 2019-10-05 20:59:12,051 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Stopping 
> SlotPool.
> 2019-10-05 20:59:12,052 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Disconnect job manager 
> 0...@akka.tcp://flink@localhost:41357/user/jobmanager_0
>  for job 4d21a21eee0bf5af6fc8616debddfcd4 from the resource manager.
> 2019-10-05 20:59:12,062 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl   - 
> JobManagerRunner already shutdown.
> 2019-10-05 20:59:12,325 ERROR org.apache.flink.yarn.YarnResourceManager   
>   - Fatal error occurred in ResourceManager.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
>   at 
> org.apache.flink.yarn.YarnResourceManager.onShutdownRequest(YarnResourceManager.java:454)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:275)
> 2019-10-05 20:59:12,330 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error 
> occurred in the cluster entrypoint.
> org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: 
> Received shutdown request from YARN ResourceManager.
>   at 
> org.apache.flink.yarn.YarnResourceManager.onShutdownRequest(YarnResourceManager.java:454)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:275)
> 2019-10-05 20:59:12,342 INFO  org.apache.flink.runtime.blob.BlobServer
>   - Stopped BLOB server at 0.0.0.0:34373
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-11632) Make TaskManager automatic bind address picking more explicit (by default) and more configurable

2019-10-10 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948807#comment-16948807
 ] 

Till Rohrmann commented on FLINK-11632:
---

Nothing which comes to my mind atm.

> Make TaskManager automatic bind address picking more explicit (by default) 
> and more configurable
> 
>
> Key: FLINK-11632
> URL: https://issues.apache.org/jira/browse/FLINK-11632
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Network
>Reporter: Alex
>Assignee: Alex
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.8.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, there is an optional {{taskmanager.host}} configuration option in 
> {{flink-conf.yaml}} that allows users of Flink to "statically" pre-define 
> what should be a bind address for TaskManager to listen on (note: it's also 
> possible to override this option by passing corresponding command line option 
> to Flink).
> In case when the option is not set, TaskManager would try [heuristically pick 
> up a bind 
> address|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L421-L442].
> The resulting address (hostname) is used to advertise different service 
> endpoints (running in TM) to the JobManager. Also it would be resolved to an 
> {{[InetAddress|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L359]}}
>  later that used as binding address for TMs inner node communication.
> This proposal is to minimize usage of heuristics (by default) by introducing 
> a new configuration option (for example, {{taskmanager.host.bind-policy}}) 
> with possible values:
>  * {{"hostname"}} - default, use TM's host's name ({{== 
> InetAddress.getLocalHost().getHostName()}};
>  * {{"ip"}} - use TM's host's ip address ({{== 
> InetAddress.getLocalHost().getHostAddress()}});
>  * {{"auto-detect-hostname"}} - use the heuristics based detection mechanism.
> *Note:* the configuration key and values could be named better and open for 
> proposals.
> *Note 2:* in the future, the configuration option _may_ require to be 
> extended to allow choosing some specific network interface, or preference of 
> ipv6 vs ipv4.
> h3. Rationale
> [The heuristics 
> mechanism|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/net/ConnectionUtils.java#L364-L475]
>  tries to establish a probe connection to {{jobmanager.rpc.address}} from 
> different network interface addresses. 
>  In case of parallel setups (when JM and multiple TMs start simultaneously, 
> in parallel), this depends on timing, assigned network ip addresses and may 
> end up with "non-uniform" address bindings of TMs (some may be "lucky" to 
> pick up non default network interface, some would fallback to 
> {{InetAddress.getLocalHost().getHostName()}}. At the end, it's less obvious 
> and transparent which binding address a TM picks up.
> In practice, it's possible that in majority of cases (in well setup 
> environments) the heuristics mechanism returns a result that matches 
> {{InetAddress.getLocalHost()}}. The proposal is to stick with this more 
> simpler and explicit binding (by default), avoiding non-determinism of 
> heuristics.
> The old mechanism is kept available, in case if it is useful in some setups. 
> But would require explicit configuration setting.
> Additionally, this proposal extends "auto configuration" option by allowing 
> users to choose the host's ip address (instead of hostname). This may be 
> convenient in situations where the TMs' machines are not necessary reachable 
> via DNS (for example in a Kubernetes setup).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14287) Decouple leader address from LeaderContender

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14287.
-
Release Note: The `LeaderElectionService#confirmLeadership(UUID, String)` 
now takes a second argument which is the address under which the leader will be 
reachable. All custom `LeaderElectionService` implementations will need to be 
updated accordingly.
  Resolution: Done

Done via 7e7ee2812f459bcfdab8862db8f6db2cab0e1368

> Decouple leader address from LeaderContender
> 
>
> Key: FLINK-14287
> URL: https://issues.apache.org/jira/browse/FLINK-14287
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> At the moment, the {{LeaderContender}} need to know the address of the leader 
> before it has gained leadership. This is problematic if one wants to decouple 
> the leader election from the actual leader component which will potentially 
> only be created after the leadership has been gained. If this is the case, 
> then it is not always possible to know the address of the actual leader 
> component before gaining the leadership.
> In order to solve this problem, I propose to change the interface of 
> {{LeaderElectionService#confirmLeadership(UUID)}} to 
> {{LeaderElectionService#confirmLeadership(UUID, String)}} where the second 
> parameter is the leader address under which the leader component is reachable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-14334) ElasticSearch docs refer to non-existent ExceptionUtils.containsThrowable

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-14334.
---
Resolution: Fixed

Fixed via

1.10.0: a5198b891da02434ceb1af17a12abf46b1e7e377
1.9.2: 2a93e90e43c4328bb1f45971227fe3ec77475c7e
1.8.3: 0281b73d9b50759def94d0350bf2ded7b910e0e3

> ElasticSearch docs refer to non-existent ExceptionUtils.containsThrowable
> -
>
> Key: FLINK-14334
> URL: https://issues.apache.org/jira/browse/FLINK-14334
> Project: Flink
>  Issue Type: Task
>  Components: Connectors / ElasticSearch, Documentation
>Affects Versions: 1.8.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.8.3, 1.9.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948728#comment-16948728
 ] 

Till Rohrmann edited comment on FLINK-14347 at 10/10/19 4:07 PM:
-

I think this problem has been introduced with FLINK-14010. I suspect that the 
{{YarnResourceManager}} receives an {{onShutdownRequest}} during the clean up 
of the test. Since we are now calling the {{FatalExceptionHandler}}, the 
process terminates and logs an exception which then fails the test.


was (Author: till.rohrmann):
I think this problem has been introduced with FLINK-14010. I suspect that the 
{{YarnResourceManager}} receives an {{onShutdownRequest}} during the clean up 
of the test. Since we are now calling the {{FatalExceptionHandler}}, the test 
process terminates which then fails the test.

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Priority: Critical
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14010) Dispatcher & JobManagers don't give up leadership when AM is shut down

2019-10-10 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948730#comment-16948730
 ] 

Till Rohrmann commented on FLINK-14010:
---

I think we introduced a Yarn test instability with this fix. FLINK-14347 looks 
as if we are reacting to an {{onShutdownRequest}} during the test clean up 
phase. Since we are calling the {{FatalExceptionHandler}} we terminate the 
process and log a fatal error in the logs which fails the test.

> Dispatcher & JobManagers don't give up leadership when AM is shut down
> --
>
> Key: FLINK-14010
> URL: https://issues.apache.org/jira/browse/FLINK-14010
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Coordination
>Affects Versions: 1.7.2, 1.8.2, 1.9.0, 1.10.0
>Reporter: Zili Chen
>Assignee: Zili Chen
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In YARN deployment scenario, YARN RM possibly launches a new AM for the job 
> even if the previous AM does not terminated, for example, when AMRM heartbeat 
> timeout. This is a common case that RM will send a shutdown request to the 
> previous AM and expect the AM shutdown properly.
> However, currently in {{YARNResourceManager}}, we handle this request in 
> {{onShutdownRequest}} which simply close the {{YARNResourceManager}} *but not 
> Dispatcher and JobManagers*. Thus, Dispatcher and JobManager launched in new 
> AM cannot be granted leadership properly. Visually,
> on previous AM: Dispatcher leader, JM leaders
> on new AM: ResourceManager leader
> since on client side or in per-job mode, JobManager address and port are 
> configured as the new AM, the whole cluster goes into an unrecoverable 
> inconsistent status: client all queries the dispatcher on new AM who is now 
> the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up 
> their leadership properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948728#comment-16948728
 ] 

Till Rohrmann commented on FLINK-14347:
---

I think this problem has been introduced with FLINK-14010. I suspect that the 
{{YarnResourceManager}} receives an {{onShutdownRequest}} during the clean up 
of the test. Since we are now calling the {{FatalExceptionHandler}}, the test 
process terminates which then fails the test.

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Priority: Critical
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14347:
--
Fix Version/s: 1.8.3
   1.9.1
   1.10.0

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Priority: Critical
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14347:
--
Affects Version/s: (was: 1.8.2)
   1.8.3
   1.9.1
   1.10.0

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Priority: Major
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14347:
--
Priority: Critical  (was: Major)

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Priority: Critical
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14347:
--
Comment: was deleted

(was: Another instance: https://api.travis-ci.org/v3/job/595082243/log.txt)

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Priority: Critical
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948726#comment-16948726
 ] 

Till Rohrmann commented on FLINK-14347:
---

Another instance: https://api.travis-ci.org/v3/job/595082243/log.txt

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.10.0, 1.9.1, 1.8.3
>Reporter: Caizhi Weng
>Priority: Critical
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14347) YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14347:
--
Component/s: Deployment / YARN

> YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with 
> prohibited string
> --
>
> Key: FLINK-14347
> URL: https://issues.apache.org/jira/browse/FLINK-14347
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.8.2
>Reporter: Caizhi Weng
>Priority: Major
>
> YARNSessionFIFOITCase.checkForProhibitedLogContents fails with the following 
> exception:
> {code:java}
> 14:55:27.643 [ERROR]   
> YARNSessionFIFOITCase.checkForProhibitedLogContents:77->YarnTestBase.ensureNoProhibitedStringInLogFiles:461
>  Found a file 
> /home/travis/build/apache/flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-1_0/application_1570546069180_0001/container_1570546069180_0001_01_01/jobmanager.log
>  with a prohibited string (one of [Exception, Started 
> SelectChannelConnector@0.0.0.0:8081]). Excerpts:23760[{code}
> Travis log link: [https://travis-ci.org/apache/flink/jobs/595082243]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-10498) Decouple ExecutionGraph from JobMaster

2019-10-10 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-10498.
-
Resolution: Duplicate

> Decouple ExecutionGraph from JobMaster
> --
>
> Key: FLINK-10498
> URL: https://issues.apache.org/jira/browse/FLINK-10498
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>
> With declarative resource management we want to react to the set of available 
> resources. Thus, we need a component which is responsible for scaling the 
> {{ExecutionGraph}} accordingly. In order to better do this and separate 
> concerns, it is beneficial to introduce a {{Scheduler/ExecutionGraphDriver}} 
> component which is in charge of the {{ExecutionGraph}}. This component owns 
> the {{ExecutionGraph}} and is allowed to modify it. In the first version, 
> this component will simply accommodate all the existing logic of the 
> {{JobMaster}} and the respective {{JobMaster}} methods are forwarded to this 
> component. 
> This new component should not change the existing behaviour of Flink.
> Later this component will be in charge of announcing the required resources, 
> deciding when to rescale and executing the rescaling operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13567) Avro Confluent Schema Registry nightly end-to-end test failed on Travis

2019-10-10 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948673#comment-16948673
 ] 

Till Rohrmann commented on FLINK-13567:
---

Would it make sense to disable this test for the moment and make this issue a 
blocker for 1.10? Given the frequency of the test failure I would be in favour 
of this.

> Avro Confluent Schema Registry nightly end-to-end test failed on Travis
> ---
>
> Key: FLINK-13567
> URL: https://issues.apache.org/jira/browse/FLINK-13567
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka, Tests
>Affects Versions: 1.8.2, 1.9.0, 1.10.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
> Attachments: patch.diff
>
>
> The {{Avro Confluent Schema Registry nightly end-to-end test}} failed on 
> Travis with
> {code}
> [FAIL] 'Avro Confluent Schema Registry nightly end-to-end test' failed after 
> 2 minutes and 11 seconds! Test exited with exit code 1
> No taskexecutor daemon (pid: 29044) is running anymore on 
> travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501.
> No standalonesession daemon to stop on host 
> travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501.
> rm: cannot remove 
> '/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins':
>  No such file or directory
> {code}
> https://api.travis-ci.org/v3/job/567273939/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13848) Support “scheduleAtFixedRate/scheduleAtFixedDelay” in RpcEndpoint#MainThreadExecutor

2019-10-09 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-13848:
-

Assignee: Biao Liu

> Support “scheduleAtFixedRate/scheduleAtFixedDelay” in 
> RpcEndpoint#MainThreadExecutor
> 
>
> Key: FLINK-13848
> URL: https://issues.apache.org/jira/browse/FLINK-13848
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Biao Liu
>Assignee: Biao Liu
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently the methods “scheduleAtFixedRate/scheduleAtFixedDelay" of 
> {{RpcEndpoint#MainThreadExecutor}} are not implemented. Because there was no 
> requirement on them before.
> Now we are planning to implement these methods to support periodic checkpoint 
> triggering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13904) Avoid competition between different rounds of checkpoint triggering

2019-10-09 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-13904:
-

Assignee: Biao Liu

> Avoid competition between different rounds of checkpoint triggering
> ---
>
> Key: FLINK-13904
> URL: https://issues.apache.org/jira/browse/FLINK-13904
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Biao Liu
>Assignee: Biao Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As a part of {{CheckpointCoordinator}} refactoring, I'd like to simplify the 
> concurrent triggering logic.
> The different rounds of checkpoint triggering would be processed 
> sequentially. The final target is getting rid of timer thread and 
> {{triggerLock}}.
> Note that we can't avoid all competitions of triggering for now. There is 
> still a competition between normal checkpoint triggering and savepoint 
> triggering. We could avoid this competition by executing triggering in main 
> thread. But it could not be achieved until all blocking operations are 
> handled well in IO threads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14344) Snapshot master hook state asynchronously

2019-10-09 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-14344:
-

Assignee: Biao Liu

> Snapshot master hook state asynchronously
> -
>
> Key: FLINK-14344
> URL: https://issues.apache.org/jira/browse/FLINK-14344
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Biao Liu
>Assignee: Biao Liu
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently we snapshot the master hook state synchronously. As a part of 
> reworking threading model of {{CheckpointCoordinator}}, we have to make this 
> non-blocking to satisfy the requirement of running in main thread.
> The behavior of snapshotting master hook state is similar to task state 
> snapshotting. Master state snapshotting is taken before task state 
> snapshotting. Because in master hook, there might be external system 
> initialization which task state snapshotting might depend on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13905) Separate checkpoint triggering into stages

2019-10-09 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-13905:
-

Assignee: Biao Liu

> Separate checkpoint triggering into stages
> --
>
> Key: FLINK-13905
> URL: https://issues.apache.org/jira/browse/FLINK-13905
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Biao Liu
>Assignee: Biao Liu
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently {{CheckpointCoordinator#triggerCheckpoint}} includes some heavy IO 
> operations. We plan to separate the triggering into different stages. The IO 
> operations are executed in IO threads, while other on-memory operations are 
> not.
> This is a preparation for making all on-memory operations of 
> {{CheckpointCoordinator}} single threaded (in main thread).
> Note that we could not put on-memory operations of triggering into main 
> thread directly now. Because there are still some operations on a heavy lock 
> (coordinator-wide).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-14315) NPE with JobMaster.disconnectTaskManager

2019-10-08 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-14315.
---
Fix Version/s: (was: 1.9.1)
   1.9.2
   Resolution: Fixed

Fixed via

1.10.0: 4064b5b67d6d220e1d5518bca96688f51cbbb891
1.9.2: 8e0d1b5fac2f10f1b34ef82dcc38df8fd83cd82b
1.8.3: 766e25027a88a75002d121c8204766ee76b466d3

> NPE with JobMaster.disconnectTaskManager
> 
>
> Key: FLINK-14315
> URL: https://issues.apache.org/jira/browse/FLINK-14315
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.2, 1.9.0, 1.10.0
>Reporter: Steven Zhen Wu
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.8.3, 1.9.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There was some connection issue with zookeeper that caused the job to 
> restart.  But shutdown failed with this fatal NPE, which seems to cause JVM 
> to exit
> {code}
> 2019-10-02 16:16:19,134 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable 
> to read additional data from server sessionid 0x16d83374c4206f8, likely 
> server has clo
> sed socket, closing socket connection and attempting reconnect
> 2019-10-02 16:16:19,234 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: SUSPENDED
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/dispatcher no longer participates in the leader election.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- 
> http://100.122.177.82:8081 lost leadership
> 2019-10-02 16:16:19,237 INFO  
> com.netflix.spaas.runtime.resourcemanager.TitusResourceManager  - 
> ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager 
> was revoked leadershi
> p. Clearing fencing token.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_
> manager_lock.
> 2019-10-02 16:16:19,237 WARN  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not 
> monitored (tem
> porarily).
> 2019-10-02 16:16:19,238 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunner   - JobManager 
> for job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) was revoked leadership at 
> akka.tcp:
> //flink@100.122.177.82:42043/user/jobmanager_0.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender http://100.122.177.82:8081 
> no longer pa
> rticipates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/jobmanager_0 no longer participates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job ksrouter 
> (e4e68f2b3fc40c7008cca624b2a2bab0) switched from state RUNNING to SUSPENDED.
> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeJobMasterLeadership(JobManagerRunner.java:391)
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$revokeLeadership$5(JobManagerRunner.java:377)
> at 
> 

[jira] [Commented] (FLINK-14237) No need to rename shipped Flink jar

2019-10-08 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946782#comment-16946782
 ] 

Till Rohrmann commented on FLINK-14237:
---

Sounds good to me. Go ahead with fixing this problem.

> No need to rename shipped Flink jar
> ---
>
> Key: FLINK-14237
> URL: https://issues.apache.org/jira/browse/FLINK-14237
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Reporter: Zili Chen
>Assignee: Zili Chen
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently, when we ship Flink jar configured by -yj, we always rename it as 
> {{flink.jar}}. It seems a redundant operation since we can always use the 
> exact name of the real jar. It also causes some confusion to our users who 
> should not be required to know about Flink internal implementation that they 
> configure a specific Flink jar(said {{flink-private-version-suffix.jar}}) but 
> cannot find it on YARN container, because it is now {{flink.jar}}.
> CC [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-11632) Make TaskManager automatic bind address picking more explicit (by default) and more configurable

2019-10-08 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946780#comment-16946780
 ] 

Till Rohrmann commented on FLINK-11632:
---

The idea of the heuristic is to enable certain setups which would fail w/o it. 
This can usually be the case for misconfigured clusters as they occur in 
practice. Removing the heuristics would inflict a regression because these 
setups will no longer work. Since this feature is in Flink for quite some time 
it is hard to tell how many setups could be affected. 

> Make TaskManager automatic bind address picking more explicit (by default) 
> and more configurable
> 
>
> Key: FLINK-11632
> URL: https://issues.apache.org/jira/browse/FLINK-11632
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Network
>Reporter: Alex
>Assignee: Alex
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.8.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, there is an optional {{taskmanager.host}} configuration option in 
> {{flink-conf.yaml}} that allows users of Flink to "statically" pre-define 
> what should be a bind address for TaskManager to listen on (note: it's also 
> possible to override this option by passing corresponding command line option 
> to Flink).
> In case when the option is not set, TaskManager would try [heuristically pick 
> up a bind 
> address|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L421-L442].
> The resulting address (hostname) is used to advertise different service 
> endpoints (running in TM) to the JobManager. Also it would be resolved to an 
> {{[InetAddress|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L359]}}
>  later that used as binding address for TMs inner node communication.
> This proposal is to minimize usage of heuristics (by default) by introducing 
> a new configuration option (for example, {{taskmanager.host.bind-policy}}) 
> with possible values:
>  * {{"hostname"}} - default, use TM's host's name ({{== 
> InetAddress.getLocalHost().getHostName()}};
>  * {{"ip"}} - use TM's host's ip address ({{== 
> InetAddress.getLocalHost().getHostAddress()}});
>  * {{"auto-detect-hostname"}} - use the heuristics based detection mechanism.
> *Note:* the configuration key and values could be named better and open for 
> proposals.
> *Note 2:* in the future, the configuration option _may_ require to be 
> extended to allow choosing some specific network interface, or preference of 
> ipv6 vs ipv4.
> h3. Rationale
> [The heuristics 
> mechanism|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/net/ConnectionUtils.java#L364-L475]
>  tries to establish a probe connection to {{jobmanager.rpc.address}} from 
> different network interface addresses. 
>  In case of parallel setups (when JM and multiple TMs start simultaneously, 
> in parallel), this depends on timing, assigned network ip addresses and may 
> end up with "non-uniform" address bindings of TMs (some may be "lucky" to 
> pick up non default network interface, some would fallback to 
> {{InetAddress.getLocalHost().getHostName()}}. At the end, it's less obvious 
> and transparent which binding address a TM picks up.
> In practice, it's possible that in majority of cases (in well setup 
> environments) the heuristics mechanism returns a result that matches 
> {{InetAddress.getLocalHost()}}. The proposal is to stick with this more 
> simpler and explicit binding (by default), avoiding non-determinism of 
> heuristics.
> The old mechanism is kept available, in case if it is useful in some setups. 
> But would require explicit configuration setting.
> Additionally, this proposal extends "auto configuration" option by allowing 
> users to choose the host's ip address (instead of hostname). This may be 
> convenient in situations where the TMs' machines are not necessary reachable 
> via DNS (for example in a Kubernetes setup).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14008) Flink Scala 2.12 binary distribution contains incorrect NOTICE-binary file

2019-10-08 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946773#comment-16946773
 ] 

Till Rohrmann commented on FLINK-14008:
---

Sounds like a good idea to me [~chesnay].

> Flink Scala 2.12 binary distribution contains incorrect NOTICE-binary file
> --
>
> Key: FLINK-14008
> URL: https://issues.apache.org/jira/browse/FLINK-14008
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.8.1, 1.9.0, 1.10.0
>Reporter: Till Rohrmann
>Priority: Blocker
> Fix For: 1.10.0
>
>
> The Flink Scala {{2.12}} binary distribution contains an incorrect 
> NOTICE-binary file. The problem is that we don't update the Scala version of 
> the Scala dependencies listed in the {{NOTICE-binary}} file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14280) Introduce DispatcherRunner for better separation of concerns

2019-10-08 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14280.
-
Resolution: Done

Done via 936a317e7e51b0a03f5f12e3d4c21fedaa6d328e

> Introduce DispatcherRunner for better separation of concerns
> 
>
> Key: FLINK-14280
> URL: https://issues.apache.org/jira/browse/FLINK-14280
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In order to separate concerns (leader election, job recovery, other 
> dispatcher work) which are currently all contained in the {{Dispatcher}}, I 
> propose to add a {{DispatcherRunner}} interface which encapsulates how the 
> {{Dispatcher}} is executed. The {{DispatcherRunner}} should be used by the 
> {{DispatcherResourceManagerComponent}} instead of directly using the 
> {{Dispatcher}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14281) Introduce DispatcherRunner#getShutDownFuture

2019-10-08 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14281.
-
Resolution: Done

Done via 01ad0f6fe23615484e36b4710cd0fdc3a7276bed

> Introduce DispatcherRunner#getShutDownFuture
> 
>
> Key: FLINK-14281
> URL: https://issues.apache.org/jira/browse/FLINK-14281
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I suggest to extend the {{DispatcherRunner}} interface to return a shut down 
> future. The idea of the shut down future is that it gets completed once the 
> runner intends to be shut down by its owner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14282) Simplify DispatcherResourceManagerComponent

2019-10-08 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14282.
-
Resolution: Done

Done via 870b56418cb795edba9d4e8e282fa91d675efd7a

> Simplify DispatcherResourceManagerComponent
> ---
>
> Key: FLINK-14282
> URL: https://issues.apache.org/jira/browse/FLINK-14282
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With the completion of the FLINK-14281 it is now possible to encapsulate the 
> shutdown logic of the {{MiniDispatcher}} within the {{DispatcherRunner}}. 
> Consequently, it is no longer necessary to have separate 
> {{DispatcherResourceManagerComponent}} implementations. I suggest to remove 
> the special case implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14284) Add shut down future to Dispatcher

2019-10-08 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14284.
-
Resolution: Done

Done via 1c5f7558626bb8c301d564c0cbc15126d6cc1731

> Add shut down future to Dispatcher
> --
>
> Key: FLINK-14284
> URL: https://issues.apache.org/jira/browse/FLINK-14284
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In order to better integrate with the {{DispatcherRunner}}, all 
> {{Dispatchers}} should report their final application status via a shut down 
> future (see FLINK-14281). Therefore, I propose to a shut down future to the 
> {{Dispatcher}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14285) Simplify Dispatcher factories by removing generics

2019-10-07 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14285.
-
Resolution: Done

Done via 282dd840fe80bd71830d1c493cdb7e1d0ad5cc6e

> Simplify Dispatcher factories by removing generics
> --
>
> Key: FLINK-14285
> URL: https://issues.apache.org/jira/browse/FLINK-14285
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{Dispatcher}} factories can be simplified by removing the generics. For 
> better maintainability of the code base, I propose to do this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14286) Remove Akka specific parsing logic from LeaderConnectionInfo

2019-10-07 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14286.
-
Resolution: Done

Done via 1082c7a470355b938ad0f7de2fa73ca5461c1bc0

> Remove Akka specific parsing logic from LeaderConnectionInfo
> 
>
> Key: FLINK-14286
> URL: https://issues.apache.org/jira/browse/FLINK-14286
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{LeaderConnectionInfo}} assumes that every leader address is an Akka 
> based address. This is not always true and unnecessarily restricts the leader 
> contender to announce an Akka address. I propose to change this by doing the 
> Akka address parsing outside of the {{LeaderConnectionInfo}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14337) HistoryServerTest.testHistoryServerIntegration failed on Travis

2019-10-07 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945975#comment-16945975
 ] 

Till Rohrmann commented on FLINK-14337:
---

The problem could have been that {{numExpectedArchivedJobs}} did not reach zero 
within the specified timeout when calling {{numExpectedArchivedJobs.await}}.

> HistoryServerTest.testHistoryServerIntegration failed on Travis
> ---
>
> Key: FLINK-14337
> URL: https://issues.apache.org/jira/browse/FLINK-14337
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.2, 1.9.0, 1.10.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> The {{HistoryServerTest.testHistoryServerIntegration}} failed on Travis with
> {code}
> [ERROR] testHistoryServerIntegration[Flink version less than 1.4: 
> false](org.apache.flink.runtime.webmonitor.history.HistoryServerTest)  Time 
> elapsed: 10.667 s  <<< FAILURE!
> java.lang.AssertionError: expected:<3> but was:<2>
> {code}
> https://api.travis-ci.org/v3/job/594533358/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14337) HistoryServerTest.testHistoryServerIntegration failed on Travis

2019-10-07 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-14337:
-

 Summary: HistoryServerTest.testHistoryServerIntegration failed on 
Travis
 Key: FLINK-14337
 URL: https://issues.apache.org/jira/browse/FLINK-14337
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.0, 1.8.2, 1.10.0
Reporter: Till Rohrmann


The {{HistoryServerTest.testHistoryServerIntegration}} failed on Travis with

{code}
[ERROR] testHistoryServerIntegration[Flink version less than 1.4: 
false](org.apache.flink.runtime.webmonitor.history.HistoryServerTest)  Time 
elapsed: 10.667 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<2>
{code}

https://api.travis-ci.org/v3/job/594533358/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14224) Kafka010ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator fails on Travis

2019-10-07 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945786#comment-16945786
 ] 

Till Rohrmann commented on FLINK-14224:
---

Another instance: https://api.travis-ci.com/v3/job/242033721/log.txt

> Kafka010ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator
>  fails on Travis
> --
>
> Key: FLINK-14224
> URL: https://issues.apache.org/jira/browse/FLINK-14224
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Alex
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> The 
> {{Kafka010ProducerITCase>KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator}}
>  fails on Travis with
> {code}
> Test 
> testOneToOneAtLeastOnceCustomOperator(org.apache.flink.streaming.connectors.kafka.Kafka010ProducerITCase)
>  failed with:
> java.lang.AssertionError: Job should fail!
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnce(KafkaProducerTestBase.java:280)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testOneToOneAtLeastOnceCustomOperator(KafkaProducerTestBase.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
> https://api.travis-ci.com/v3/job/238920411/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14223) Kafka010ProducerITCase>KafkaProducerTestBase.testCustomPartitioning fails on Travis

2019-10-07 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945785#comment-16945785
 ] 

Till Rohrmann commented on FLINK-14223:
---

Another instance: https://api.travis-ci.com/v3/job/242033721/log.txt

> Kafka010ProducerITCase>KafkaProducerTestBase.testCustomPartitioning fails on 
> Travis
> ---
>
> Key: FLINK-14223
> URL: https://issues.apache.org/jira/browse/FLINK-14223
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> The {{Kafka010ProducerITCase>KafkaProducerTestBase.testCustomPartitioning}} 
> fails on Travis with
> {code}
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:627)
>   at 
> org.apache.flink.streaming.util.TestStreamEnvironment.execute(TestStreamEnvironment.java:78)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1505)
>   at org.apache.flink.test.util.TestUtils.tryExecute(TestUtils.java:35)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testCustomPartitioning(KafkaProducerTestBase.java:188)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Caused by: org.apache.kafka.common.errors.TimeoutException: Failed to update 
> metadata after 6 ms.
> {code}
> https://api.travis-ci.com/v3/job/238920411/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14316) stuck in "Job leader ... lost leadership" error

2019-10-07 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945780#comment-16945780
 ] 

Till Rohrmann commented on FLINK-14316:
---

Thanks for the information. Can you provide us with the full logs of the run 
where the problem occurred?

> stuck in "Job leader ... lost leadership" error
> ---
>
> Key: FLINK-14316
> URL: https://issues.apache.org/jira/browse/FLINK-14316
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.7.2
>Reporter: Steven Zhen Wu
>Priority: Major
>
> This is the first exception caused restart loop. Later exceptions are the 
> same. Job seems to stuck in this permanent failure state.
> {code}
> 2019-10-03 21:42:46,159 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: 
> clpevents -> device_filter -> processed_imps -> ios_processed_impression -> i
> mps_ts_assigner (449/1360) (d237f5e99b6a4a580498821473763edb) switched from 
> SCHEDULED to FAILED.
> java.lang.Exception: Job leader for job id ecb9ad9be934edf7b1a4f7b9dd6df365 
> lost leadership.
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1526)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14318) JDK11 build stalls during shading

2019-10-07 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14318:
--
Affects Version/s: 1.10.0

> JDK11 build stalls during shading
> -
>
> Key: FLINK-14318
> URL: https://issues.apache.org/jira/browse/FLINK-14318
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.10.0
>Reporter: Gary Yao
>Priority: Critical
>
> JDK11 build stalls during shading.
> Travis stage: e2d - misc - jdk11
> https://travis-ci.org/apache/flink/builds/593022581?utm_source=slack_medium=notification
> https://api.travis-ci.org/v3/job/593022629/log.txt
> Relevant excerpt from logs:
> {noformat}
> 01:53:43.889 [INFO] 
> 
> 01:53:43.889 [INFO] Building flink-metrics-reporter-prometheus-test 
> 1.10-SNAPSHOT
> 01:53:43.889 [INFO] 
> 
> ...
> 01:53:44.508 [INFO] Including 
> org.apache.flink:force-shading:jar:1.10-SNAPSHOT in the shaded jar.
> 01:53:44.508 [INFO] Excluding org.slf4j:slf4j-api:jar:1.7.15 from the shaded 
> jar.
> 01:53:44.508 [INFO] Excluding com.google.code.findbugs:jsr305:jar:1.3.9 from 
> the shaded jar.
> 01:53:44.508 [INFO] No artifact matching filter io.netty:netty
> 01:53:44.522 [INFO] Replacing original artifact with shaded artifact.
> 01:53:44.523 [INFO] Replacing 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT.jar
>  with 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-shaded.jar
> 01:53:44.524 [INFO] Replacing original test artifact with shaded test 
> artifact.
> 01:53:44.524 [INFO] Replacing 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-tests.jar
>  with 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-shaded-tests.jar
> 01:53:44.524 [INFO] Dependency-reduced POM written at: 
> /home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/dependency-reduced-pom.xml
> No output has been received in the last 10m0s, this potentially indicates a 
> stalled build or something wrong with the build itself.
> Check the details on how to adjust your build configuration on: 
> https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
> The build has been terminated
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-6053) Gauge should only take subclasses of Number, rather than everything

2019-10-04 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-6053:
-
Priority: Minor  (was: Major)

> Gauge should only take subclasses of Number, rather than everything
> --
>
> Key: FLINK-6053
> URL: https://issues.apache.org/jira/browse/FLINK-6053
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.2.0
>Reporter: Bowen Li
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, Flink's Gauge is defined as 
> ```java
> public interface Gauge extends Metric {
>   T getValue();
> }
> ```
> But it doesn't make sense to have Gauge take generic types other than Number. 
> And it blocks I from finishing FLINK-6013, because I cannot assume Gauge is 
> only about Number. So the class should be like
> ```java
> public interface Gauge extends Metric {
>   T getValue();
> }
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14315) NPE with JobMaster.disconnectTaskManager

2019-10-04 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14315:
--
Fix Version/s: 1.8.3
   1.9.1
   1.10.0

> NPE with JobMaster.disconnectTaskManager
> 
>
> Key: FLINK-14315
> URL: https://issues.apache.org/jira/browse/FLINK-14315
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: Steven Zhen Wu
>Assignee: Till Rohrmann
>Priority: Critical
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>
> There was some connection issue with zookeeper that caused the job to 
> restart.  But shutdown failed with this fatal NPE, which seems to cause JVM 
> to exit
> {code}
> 2019-10-02 16:16:19,134 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable 
> to read additional data from server sessionid 0x16d83374c4206f8, likely 
> server has clo
> sed socket, closing socket connection and attempting reconnect
> 2019-10-02 16:16:19,234 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: SUSPENDED
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/dispatcher no longer participates in the leader election.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- 
> http://100.122.177.82:8081 lost leadership
> 2019-10-02 16:16:19,237 INFO  
> com.netflix.spaas.runtime.resourcemanager.TitusResourceManager  - 
> ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager 
> was revoked leadershi
> p. Clearing fencing token.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_
> manager_lock.
> 2019-10-02 16:16:19,237 WARN  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not 
> monitored (tem
> porarily).
> 2019-10-02 16:16:19,238 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunner   - JobManager 
> for job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) was revoked leadership at 
> akka.tcp:
> //flink@100.122.177.82:42043/user/jobmanager_0.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender http://100.122.177.82:8081 
> no longer pa
> rticipates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/jobmanager_0 no longer participates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job ksrouter 
> (e4e68f2b3fc40c7008cca624b2a2bab0) switched from state RUNNING to SUSPENDED.
> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeJobMasterLeadership(JobManagerRunner.java:391)
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$revokeLeadership$5(JobManagerRunner.java:377)
> at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981)
> at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2124)
> at 
> 

[jira] [Updated] (FLINK-14315) NPE with JobMaster.disconnectTaskManager

2019-10-04 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14315:
--
Priority: Critical  (was: Major)

> NPE with JobMaster.disconnectTaskManager
> 
>
> Key: FLINK-14315
> URL: https://issues.apache.org/jira/browse/FLINK-14315
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: Steven Zhen Wu
>Assignee: Till Rohrmann
>Priority: Critical
>
> There was some connection issue with zookeeper that caused the job to 
> restart.  But shutdown failed with this fatal NPE, which seems to cause JVM 
> to exit
> {code}
> 2019-10-02 16:16:19,134 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable 
> to read additional data from server sessionid 0x16d83374c4206f8, likely 
> server has clo
> sed socket, closing socket connection and attempting reconnect
> 2019-10-02 16:16:19,234 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: SUSPENDED
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/dispatcher no longer participates in the leader election.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- 
> http://100.122.177.82:8081 lost leadership
> 2019-10-02 16:16:19,237 INFO  
> com.netflix.spaas.runtime.resourcemanager.TitusResourceManager  - 
> ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager 
> was revoked leadershi
> p. Clearing fencing token.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_
> manager_lock.
> 2019-10-02 16:16:19,237 WARN  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not 
> monitored (tem
> porarily).
> 2019-10-02 16:16:19,238 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunner   - JobManager 
> for job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) was revoked leadership at 
> akka.tcp:
> //flink@100.122.177.82:42043/user/jobmanager_0.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender http://100.122.177.82:8081 
> no longer pa
> rticipates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/jobmanager_0 no longer participates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job ksrouter 
> (e4e68f2b3fc40c7008cca624b2a2bab0) switched from state RUNNING to SUSPENDED.
> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeJobMasterLeadership(JobManagerRunner.java:391)
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$revokeLeadership$5(JobManagerRunner.java:377)
> at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981)
> at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2124)
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:374)
> at 
> 

[jira] [Updated] (FLINK-14315) NPE with JobMaster.disconnectTaskManager

2019-10-04 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14315:
--
Affects Version/s: 1.10.0
   1.8.2

> NPE with JobMaster.disconnectTaskManager
> 
>
> Key: FLINK-14315
> URL: https://issues.apache.org/jira/browse/FLINK-14315
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.8.2, 1.9.0, 1.10.0
>Reporter: Steven Zhen Wu
>Assignee: Till Rohrmann
>Priority: Critical
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
>
> There was some connection issue with zookeeper that caused the job to 
> restart.  But shutdown failed with this fatal NPE, which seems to cause JVM 
> to exit
> {code}
> 2019-10-02 16:16:19,134 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable 
> to read additional data from server sessionid 0x16d83374c4206f8, likely 
> server has clo
> sed socket, closing socket connection and attempting reconnect
> 2019-10-02 16:16:19,234 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: SUSPENDED
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/dispatcher no longer participates in the leader election.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- 
> http://100.122.177.82:8081 lost leadership
> 2019-10-02 16:16:19,237 INFO  
> com.netflix.spaas.runtime.resourcemanager.TitusResourceManager  - 
> ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager 
> was revoked leadershi
> p. Clearing fencing token.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_
> manager_lock.
> 2019-10-02 16:16:19,237 WARN  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not 
> monitored (tem
> porarily).
> 2019-10-02 16:16:19,238 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunner   - JobManager 
> for job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) was revoked leadership at 
> akka.tcp:
> //flink@100.122.177.82:42043/user/jobmanager_0.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender http://100.122.177.82:8081 
> no longer pa
> rticipates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/jobmanager_0 no longer participates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job ksrouter 
> (e4e68f2b3fc40c7008cca624b2a2bab0) switched from state RUNNING to SUSPENDED.
> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeJobMasterLeadership(JobManagerRunner.java:391)
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$revokeLeadership$5(JobManagerRunner.java:377)
> at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981)
> at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2124)
> at 
> 

[jira] [Assigned] (FLINK-14315) NPE with JobMaster.disconnectTaskManager

2019-10-04 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-14315:
-

Assignee: Till Rohrmann

> NPE with JobMaster.disconnectTaskManager
> 
>
> Key: FLINK-14315
> URL: https://issues.apache.org/jira/browse/FLINK-14315
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: Steven Zhen Wu
>Assignee: Till Rohrmann
>Priority: Major
>
> There was some connection issue with zookeeper that caused the job to 
> restart.  But shutdown failed with this fatal NPE, which seems to cause JVM 
> to exit
> {code}
> 2019-10-02 16:16:19,134 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable 
> to read additional data from server sessionid 0x16d83374c4206f8, likely 
> server has clo
> sed socket, closing socket connection and attempting reconnect
> 2019-10-02 16:16:19,234 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: SUSPENDED
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/dispatcher no longer participates in the leader election.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- 
> http://100.122.177.82:8081 lost leadership
> 2019-10-02 16:16:19,237 INFO  
> com.netflix.spaas.runtime.resourcemanager.TitusResourceManager  - 
> ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager 
> was revoked leadershi
> p. Clearing fencing token.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_
> manager_lock.
> 2019-10-02 16:16:19,237 WARN  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not 
> monitored (tem
> porarily).
> 2019-10-02 16:16:19,238 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunner   - JobManager 
> for job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) was revoked leadership at 
> akka.tcp:
> //flink@100.122.177.82:42043/user/jobmanager_0.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender http://100.122.177.82:8081 
> no longer pa
> rticipates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/jobmanager_0 no longer participates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job ksrouter 
> (e4e68f2b3fc40c7008cca624b2a2bab0) switched from state RUNNING to SUSPENDED.
> org.apache.flink.util.FlinkException: JobManager is no longer the leader.
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeJobMasterLeadership(JobManagerRunner.java:391)
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$revokeLeadership$5(JobManagerRunner.java:377)
> at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981)
> at 
> java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2124)
> at 
> org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:374)
> at 
> 

[jira] [Commented] (FLINK-14315) NPE with JobMaster.disconnectTaskManager

2019-10-04 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944427#comment-16944427
 ] 

Till Rohrmann commented on FLINK-14315:
---

Thanks for reporting this issue [~stevenz3wu]. I think your analysis is correct 
[~tison]. The problem seems to be that we first suspend the {{JobMaster}} and 
then shortly afterwards shut it down. This causes the race condition in which 
we first {{null}} the {{taskManagerHeartbeatManager}} field and the call 
{{disconnectTaskManager}} which uses this field.

This is clearly a bug and should be fixed. I suggest to do a quick fix and 
introducing a {{null}} check.

The proper fix is in my opinion to not use a {{JobMaster}} instance across 
leader sessions. Removing the mutability should solve these kind of problems. 
The issue to track this effort is FLINK-11719 which I will continue once 
FLINK-11843 has been completed.

> NPE with JobMaster.disconnectTaskManager
> 
>
> Key: FLINK-14315
> URL: https://issues.apache.org/jira/browse/FLINK-14315
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.9.0
>Reporter: Steven Zhen Wu
>Priority: Major
>
> There was some connection issue with zookeeper that caused the job to 
> restart.  But shutdown failed with this fatal NPE, which seems to cause JVM 
> to exit
> {code}
> 2019-10-02 16:16:19,134 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Unable 
> to read additional data from server sessionid 0x16d83374c4206f8, likely 
> server has clo
> sed socket, closing socket connection and attempting reconnect
> 2019-10-02 16:16:19,234 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: SUSPENDED
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,235 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/dispatcher no longer participates in the leader election.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- 
> http://100.122.177.82:8081 lost leadership
> 2019-10-02 16:16:19,237 INFO  
> com.netflix.spaas.runtime.resourcemanager.TitusResourceManager  - 
> ResourceManager akka.tcp://flink@100.122.177.82:42043/user/resourcemanager 
> was revoked leadershi
> p. Clearing fencing token.
> 2019-10-02 16:16:19,237 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/e4e68f2b3fc40c7008cca624b2a2bab0/job_
> manager_lock.
> 2019-10-02 16:16:19,237 WARN  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not 
> monitored (tem
> porarily).
> 2019-10-02 16:16:19,238 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunner   - JobManager 
> for job ksrouter (e4e68f2b3fc40c7008cca624b2a2bab0) was revoked leadership at 
> akka.tcp:
> //flink@100.122.177.82:42043/user/jobmanager_0.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender http://100.122.177.82:8081 
> no longer pa
> rticipates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - 
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@100.122.177.82:42043/u
> ser/jobmanager_0 no longer participates in the leader election.
> 2019-10-02 16:16:19,239 WARN  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
> ZooKeeper.
> 2019-10-02 16:16:19,239 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job ksrouter 
> (e4e68f2b3fc40c7008cca624b2a2bab0) switched from state RUNNING to SUSPENDED.
> 

[jira] [Commented] (FLINK-14316) stuck in "Job leader ... lost leadership" error

2019-10-04 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944374#comment-16944374
 ] 

Till Rohrmann commented on FLINK-14316:
---

Thanks for reporting this issue [~stevenz3wu]. Does the same problem occur with 
Flink {{1.8}} or {{1.9}}? I'm asking because the community no longer supports 
Flink {{1.7}} and below.

> stuck in "Job leader ... lost leadership" error
> ---
>
> Key: FLINK-14316
> URL: https://issues.apache.org/jira/browse/FLINK-14316
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.7.2
>Reporter: Steven Zhen Wu
>Priority: Major
>
> This is the first exception caused restart loop. Later exceptions are the 
> same. Job seems to stuck in this permanent failure state.
> {code}
> 2019-10-03 21:42:46,159 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: 
> clpevents -> device_filter -> processed_imps -> ios_processed_impression -> i
> mps_ts_assigner (449/1360) (d237f5e99b6a4a580498821473763edb) switched from 
> SCHEDULED to FAILED.
> java.lang.Exception: Job leader for job id ecb9ad9be934edf7b1a4f7b9dd6df365 
> lost leadership.
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1526)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14299) Factor status and system metrics out of JobManagerMetricGroup

2019-10-02 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14299.
-
Resolution: Done

Done via 972576954e9e3d8d6d6c5e23a6fcad9d60b0a6c8

> Factor status and system metrics out of JobManagerMetricGroup
> -
>
> Key: FLINK-14299
> URL: https://issues.apache.org/jira/browse/FLINK-14299
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> At the moment, we use the {{JobManagerMetricGroup}} to not only register 
> {{Dispatcher}} specific metrics but also process specific metrics such as 
> CPU, threads, memory, etc. Due to this fact, it is not possible to close the 
> {{JobManagerMetricGroup}} when the life time of the {{Dispatcher}} 
> terminates. In order to do this, I suggest to introduce a new 
> {{ProcessMetricGroup}} which is used to register the process specific 
> metrics. 
> In order to guarantee backwards compatibility, I suggest to use the same 
> scope format as {{SCOPE_NAMING_JM}} and then appending {{.Status}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-14303) Factor resource manager related metrics out of JobManagerMetricGroup

2019-10-02 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942932#comment-16942932
 ] 

Till Rohrmann edited comment on FLINK-14303 at 10/2/19 4:02 PM:


Done via 

7eda4fc0be962d8af5d87c4120a4b5a2a30b9c11
a682254dbdbe4defa3560ec949401622d93a0a83


was (Author: till.rohrmann):
Done via 7eda4fc0be962d8af5d87c4120a4b5a2a30b9c11

> Factor resource manager related metrics out of JobManagerMetricGroup
> 
>
> Key: FLINK-14303
> URL: https://issues.apache.org/jira/browse/FLINK-14303
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> At the moment, we use the {{JobManagerMetricGroup}} to report resource 
> manager specific metrics such as the number of slots or registered 
> {{TaskExecutors}}. Since the {{JobManagerMetricGroup}} is shared between the 
> {{ResourceManager}} and the {{Dispatcher}} this prevents closing the 
> {{JobManagerMetricGroup}} from the {{Dispatcher}}. 
> In order to solve this problem, I propose to introduce a 
> {{ResourceManagerMetricGroup}} which is used by the {{ResourceManager}} to 
> register resource manager specific metrics. In order to not break existing 
> third party applications which rely on the current metrics layout, the 
> {{ResourceManagerMetricGroup}} will use the scope and group name of the 
> {{JobManagerMetricGroup}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14303) Factor resource manager related metrics out of JobManagerMetricGroup

2019-10-02 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14303.
-
Resolution: Done

Done via 7eda4fc0be962d8af5d87c4120a4b5a2a30b9c11

> Factor resource manager related metrics out of JobManagerMetricGroup
> 
>
> Key: FLINK-14303
> URL: https://issues.apache.org/jira/browse/FLINK-14303
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> At the moment, we use the {{JobManagerMetricGroup}} to report resource 
> manager specific metrics such as the number of slots or registered 
> {{TaskExecutors}}. Since the {{JobManagerMetricGroup}} is shared between the 
> {{ResourceManager}} and the {{Dispatcher}} this prevents closing the 
> {{JobManagerMetricGroup}} from the {{Dispatcher}}. 
> In order to solve this problem, I propose to introduce a 
> {{ResourceManagerMetricGroup}} which is used by the {{ResourceManager}} to 
> register resource manager specific metrics. In order to not break existing 
> third party applications which rely on the current metrics layout, the 
> {{ResourceManagerMetricGroup}} will use the scope and group name of the 
> {{JobManagerMetricGroup}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14305) Move ownership of JobManagerMetricGroup to Dispatcher

2019-10-02 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-14305.
-
Resolution: Done

Done via 52e5d7c4fa479adbb48cf83a9f09d78978ddd8b7

> Move ownership of JobManagerMetricGroup to Dispatcher
> -
>
> Key: FLINK-14305
> URL: https://issues.apache.org/jira/browse/FLINK-14305
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> With FLINK-14303 and FLINK-14299, it is now possible to move the ownership of 
> the {{JobManagerMetricGroup}} into the {{Dispatcher}}. This makes the 
> lifespan of the metric group shorter and the lifecycle management easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14237) No need to rename shipped Flink jar

2019-10-02 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942670#comment-16942670
 ] 

Till Rohrmann commented on FLINK-14237:
---

I'm not entirely sure why this was done. It could perfectly be that there is no 
particular reason for it. If this is the case, then we can change it. But we 
need first to check [~tison].

> No need to rename shipped Flink jar
> ---
>
> Key: FLINK-14237
> URL: https://issues.apache.org/jira/browse/FLINK-14237
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Reporter: Zili Chen
>Assignee: Zili Chen
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently, when we ship Flink jar configured by -yj, we always rename it as 
> {{flink.jar}}. It seems a redundant operation since we can always use the 
> exact name of the real jar. It also causes some confusion to our users who 
> should not be required to know about Flink internal implementation that they 
> configure a specific Flink jar(said {{flink-private-version-suffix.jar}}) but 
> cannot find it on YARN container, because it is now {{flink.jar}}.
> CC [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14271) Deprecate legacy RestartPipelinedRegionStrategy

2019-10-02 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942657#comment-16942657
 ] 

Till Rohrmann commented on FLINK-14271:
---

[~zhuzh] as long as we have users we cannot simply remove it and should rather 
deprecate it first. If [~stevenz3wu] is fine with using the properly working 
{{AdaptedRestartPipelinedRegionStrategyNG}}, then we could remove it since the 
old strategy is basically broken.

> Deprecate legacy RestartPipelinedRegionStrategy
> ---
>
> Key: FLINK-14271
> URL: https://issues.apache.org/jira/browse/FLINK-14271
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: Zhu Zhu
>Priority: Minor
> Fix For: 1.10.0
>
>
> The legacy {{RestartPipelinedRegionStrategy}} has been superseded by 
> {{AdaptedRestartPipelinedRegionStrategyNG}} in Flink 1.9.
> It heavily depends on ExecutionGraph components and becomes a blocker for a 
> clean scheduler re-architecture.
> We should deprecate it for further removal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14306) flink-python build fails with No module named pkg_resources

2019-10-02 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-14306:
--
Component/s: Build System

> flink-python build fails with No module named pkg_resources
> ---
>
> Key: FLINK-14306
> URL: https://issues.apache.org/jira/browse/FLINK-14306
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python, Build System
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Priority: Critical
> Fix For: 1.10.0
>
>
> [Benchmark 
> builds|http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/4576/console]
>  started to fail with
> {noformat}
> [INFO] Adding generated sources (java): 
> /home/jenkins/workspace/flink-master-benchmarks/flink/flink-python/target/generated-sources
> [INFO] 
> [INFO] --- exec-maven-plugin:1.5.0:exec (Protos Generation) @ 
> flink-python_2.11 ---
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/flink-master-benchmarks/flink/flink-python/pyflink/gen_protos.py",
>  line 33, in 
> import pkg_resources
> ImportError: No module named pkg_resources
> [ERROR] Command execution failed.
> (...)
> [INFO] flink-state-processor-api .. SUCCESS [  0.299 
> s]
> [INFO] flink-python ... FAILURE [  0.434 
> s]
> [INFO] flink-scala-shell .. SKIPPED
> {noformat}
> because of this ticket: https://issues.apache.org/jira/browse/FLINK-14018
> I think I can solve the benchmark builds failing quite easily by installing 
> {{setuptools}} python package, so this ticket is not about this, but about 
> deciding how should we treat such kind of external dependencies. I don't see 
> this dependency being mentioned anywhere in the documentation ([for example 
> here|https://ci.apache.org/projects/flink/flink-docs-stable/flinkDev/building.html]).
> Probably at the very least those external dependencies should be documented, 
> but also I fear about such kind of manual steps to do before building the 
> Flink can become a problem if grow out of control. Some questions:
> # Do we really need this dependency?
> # Could this dependency be resolve automatically? By installing into a local 
> python virtual environment?
> # Should we document those dependencies somewhere?
> # Maybe we should not build flink-python by default?
> # Maybe we should add a pre-build script for flink-python to verify the 
> dependencies and to throw an easy to understand error with hint how to fix it?
> CC [~hequn] [~dian.fu] [~trohrmann] [~jincheng]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >