[jira] [Commented] (FLINK-6293) Flakey JobManagerITCase

2017-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988798#comment-15988798
 ] 

ASF GitHub Bot commented on FLINK-6293:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/3796


> Flakey JobManagerITCase
> ---
>
> Key: FLINK-6293
> URL: https://issues.apache.org/jira/browse/FLINK-6293
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager, Tests
>Affects Versions: 1.3.0
>Reporter: Nico Kruber
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.3.0
>
>
> Quite seldomly, {{JobManagerITCase}} seems to hang, e.g. see 
> https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true 
> The maven watchdog kills the build due to not output being produced within 
> 300s and {{JobManagerITCase}} seems to hang in line 772, i.e.
> {code:title=JobManagerITCase lines 
> 770-772|language=java|linenumbers=true|firstline=770}
> // Trigger savepoint for non-existing job
> jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
> val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
> {code}
> Although the (downloaded) logs do not quite allow a precise mapping to this 
> test case, it looks as if the following block may be related:
> {code}
> 09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Akka ask timeout set to 100s
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Disabled queryable state server
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Starting FlinkMiniCluster.
> 09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger   
>- Slf4jLogger started
> 09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Created BLOB server storage directory 
> /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
> 09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils  
>- Not a SSL socket, will skip setting tls version and cipher suites.
> 09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max 
> backlog: 1000
> 09:34:47,840 INFO  org.apache.flink.runtime.metrics.MetricRegistry
>- No metrics reporter configured, no metrics will be exposed/reported.
> 09:34:47,850 INFO  
> org.apache.flink.runtime.testingUtils.TestingMemoryArchivist  - Started 
> memory archivist akka://flink/user/archive_1
> 09:34:47,860 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Trying to associate with JobManager leader akka://flink/user/jobmanager_1
> 09:34:47,861 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Starting JobManager at akka://flink/user/jobmanager_1.
> 09:34:47,862 WARN  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Discard message 
> LeaderSessionMessage(----,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any)))
>  because there is currently no valid leader id known.
> 09:34:47,862 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- JobManager akka://flink/user/jobmanager_1 was granted leadership with 
> leader session ID Some(----).
> 09:34:47,867 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Resource Manager associating with leading JobManager 
> Actor[akka://flink/user/jobmanager_1#-652927556] - leader session 
> ----
> {code}
> If so, then this may be related to FLINK-6287 and may possibly even be a 
> duplicate.
> What is strange though is that the timeout for the expected message to arrive 
> is no more than 2m and thus the test should properly fail within 300s.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6293) Flakey JobManagerITCase

2017-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988656#comment-15988656
 ] 

ASF GitHub Bot commented on FLINK-6293:
---

Github user uce commented on the issue:

https://github.com/apache/flink/pull/3796
  
Good fix. +1 to merge.


> Flakey JobManagerITCase
> ---
>
> Key: FLINK-6293
> URL: https://issues.apache.org/jira/browse/FLINK-6293
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager, Tests
>Affects Versions: 1.3.0
>Reporter: Nico Kruber
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> Quite seldomly, {{JobManagerITCase}} seems to hang, e.g. see 
> https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true 
> The maven watchdog kills the build due to not output being produced within 
> 300s and {{JobManagerITCase}} seems to hang in line 772, i.e.
> {code:title=JobManagerITCase lines 
> 770-772|language=java|linenumbers=true|firstline=770}
> // Trigger savepoint for non-existing job
> jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
> val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
> {code}
> Although the (downloaded) logs do not quite allow a precise mapping to this 
> test case, it looks as if the following block may be related:
> {code}
> 09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Akka ask timeout set to 100s
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Disabled queryable state server
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Starting FlinkMiniCluster.
> 09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger   
>- Slf4jLogger started
> 09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Created BLOB server storage directory 
> /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
> 09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils  
>- Not a SSL socket, will skip setting tls version and cipher suites.
> 09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max 
> backlog: 1000
> 09:34:47,840 INFO  org.apache.flink.runtime.metrics.MetricRegistry
>- No metrics reporter configured, no metrics will be exposed/reported.
> 09:34:47,850 INFO  
> org.apache.flink.runtime.testingUtils.TestingMemoryArchivist  - Started 
> memory archivist akka://flink/user/archive_1
> 09:34:47,860 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Trying to associate with JobManager leader akka://flink/user/jobmanager_1
> 09:34:47,861 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Starting JobManager at akka://flink/user/jobmanager_1.
> 09:34:47,862 WARN  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Discard message 
> LeaderSessionMessage(----,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any)))
>  because there is currently no valid leader id known.
> 09:34:47,862 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- JobManager akka://flink/user/jobmanager_1 was granted leadership with 
> leader session ID Some(----).
> 09:34:47,867 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Resource Manager associating with leading JobManager 
> Actor[akka://flink/user/jobmanager_1#-652927556] - leader session 
> ----
> {code}
> If so, then this may be related to FLINK-6287 and may possibly even be a 
> duplicate.
> What is strange though is that the timeout for the expected message to arrive 
> is no more than 2m and thus the test should properly fail within 300s.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6293) Flakey JobManagerITCase

2017-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988394#comment-15988394
 ] 

ASF GitHub Bot commented on FLINK-6293:
---

GitHub user tillrohrmann opened a pull request:

https://github.com/apache/flink/pull/3796

[FLINK-6293] [tests] Harden JobManagerITCase

One of the unit tests in JobManagerITCase starts a MiniCluster and sends a
LeaderSessionMessage to the JobManager without waiting until the JobManager
has gained leadership. This can lead to a dropped TriggerSavepoint message
which will cause the test to deadlock.

This PR fixes the problem by explicitly waiting for the JobManager to become
the leader.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tillrohrmann/flink fixJobManagerITCase

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3796.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3796


commit 5abf141c489154f1fc5650a27b0eb19dbaa29e75
Author: Till Rohrmann 
Date:   2017-04-28T08:04:57Z

[FLINK-6293] [tests] Harden JobManagerITCase

One of the unit tests in JobManagerITCase starts a MiniCluster and sends a
LeaderSessionMessage to the JobManager without waiting until the JobManager
has gained leadership. This can lead to a dropped TriggerSavepoint message
which will cause the test to deadlock.

This PR fixes the problem by explicitly waiting for the JobManager to become
the leader.




> Flakey JobManagerITCase
> ---
>
> Key: FLINK-6293
> URL: https://issues.apache.org/jira/browse/FLINK-6293
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager, Tests
>Affects Versions: 1.3.0
>Reporter: Nico Kruber
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> Quite seldomly, {{JobManagerITCase}} seems to hang, e.g. see 
> https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true 
> The maven watchdog kills the build due to not output being produced within 
> 300s and {{JobManagerITCase}} seems to hang in line 772, i.e.
> {code:title=JobManagerITCase lines 
> 770-772|language=java|linenumbers=true|firstline=770}
> // Trigger savepoint for non-existing job
> jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
> val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
> {code}
> Although the (downloaded) logs do not quite allow a precise mapping to this 
> test case, it looks as if the following block may be related:
> {code}
> 09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Akka ask timeout set to 100s
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Disabled queryable state server
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Starting FlinkMiniCluster.
> 09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger   
>- Slf4jLogger started
> 09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Created BLOB server storage directory 
> /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
> 09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils  
>- Not a SSL socket, will skip setting tls version and cipher suites.
> 09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max 
> backlog: 1000
> 09:34:47,840 INFO  org.apache.flink.runtime.metrics.MetricRegistry
>- No metrics reporter configured, no metrics will be exposed/reported.
> 09:34:47,850 INFO  
> org.apache.flink.runtime.testingUtils.TestingMemoryArchivist  - Started 
> memory archivist akka://flink/user/archive_1
> 09:34:47,860 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Trying to associate with JobManager leader akka://flink/user/jobmanager_1
> 09:34:47,861 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Starting JobManager at akka://flink/user/jobmanager_1.
> 09:34:47,862 WARN  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Discard message 
> LeaderSessionMessage(----,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any)))
>  because there is currently no valid leader id known.
> 09:34:47,862 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- JobManager akka://flink/user/jobmanager_1 was granted leadership with 
> leader session ID Some(----).
> 

[jira] [Commented] (FLINK-6293) Flakey JobManagerITCase

2017-04-28 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988390#comment-15988390
 ] 

Till Rohrmann commented on FLINK-6293:
--

I think the problem is that in the "handle trigger savepoint response for 
non-existing job" test, we retrieve the leader gateway but do not wait until 
the JobManager has gained leadership. This is possible when using the 
standalone leader retrieval service. As a consequence, we can end up sending 
the {{TriggerSavepoint}} savepoint message too early (before the JobManager has 
gained leadership and, thus, dropping the {{TriggerSavepoint}} message).

> Flakey JobManagerITCase
> ---
>
> Key: FLINK-6293
> URL: https://issues.apache.org/jira/browse/FLINK-6293
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager, Tests
>Affects Versions: 1.3.0
>Reporter: Nico Kruber
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
>
> Quite seldomly, {{JobManagerITCase}} seems to hang, e.g. see 
> https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true 
> The maven watchdog kills the build due to not output being produced within 
> 300s and {{JobManagerITCase}} seems to hang in line 772, i.e.
> {code:title=JobManagerITCase lines 
> 770-772|language=java|linenumbers=true|firstline=770}
> // Trigger savepoint for non-existing job
> jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
> val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
> {code}
> Although the (downloaded) logs do not quite allow a precise mapping to this 
> test case, it looks as if the following block may be related:
> {code}
> 09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Akka ask timeout set to 100s
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Disabled queryable state server
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Starting FlinkMiniCluster.
> 09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger   
>- Slf4jLogger started
> 09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Created BLOB server storage directory 
> /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
> 09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils  
>- Not a SSL socket, will skip setting tls version and cipher suites.
> 09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max 
> backlog: 1000
> 09:34:47,840 INFO  org.apache.flink.runtime.metrics.MetricRegistry
>- No metrics reporter configured, no metrics will be exposed/reported.
> 09:34:47,850 INFO  
> org.apache.flink.runtime.testingUtils.TestingMemoryArchivist  - Started 
> memory archivist akka://flink/user/archive_1
> 09:34:47,860 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Trying to associate with JobManager leader akka://flink/user/jobmanager_1
> 09:34:47,861 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Starting JobManager at akka://flink/user/jobmanager_1.
> 09:34:47,862 WARN  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Discard message 
> LeaderSessionMessage(----,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any)))
>  because there is currently no valid leader id known.
> 09:34:47,862 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- JobManager akka://flink/user/jobmanager_1 was granted leadership with 
> leader session ID Some(----).
> 09:34:47,867 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Resource Manager associating with leading JobManager 
> Actor[akka://flink/user/jobmanager_1#-652927556] - leader session 
> ----
> {code}
> If so, then this may be related to FLINK-6287 and may possibly even be a 
> duplicate.
> What is strange though is that the timeout for the expected message to arrive 
> is no more than 2m and thus the test should properly fail within 300s.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6293) Flakey JobManagerITCase

2017-04-26 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985432#comment-15985432
 ] 

Stephan Ewen commented on FLINK-6293:
-

Hitting this frequently on local builds as well:

{code}
Tests run: 21, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1,220.166 sec 
<<< FAILURE! - in org.apache.flink.runtime.jobmanager.JobManagerITCase
The JobManager actor must handle trigger savepoint response for non-existing 
job(org.apache.flink.runtime.jobmanager.JobManagerITCase)  Time elapsed: 
1,199.316 sec  <<< FAILURE!
java.lang.AssertionError: assertion failed: timeout (1199213200030 nanoseconds) 
during expectMsgClass waiting for class 
org.apache.flink.runtime.messages.JobManagerMessages$TriggerSavepointFailure
at scala.Predef$.assert(Predef.scala:179)
at 
akka.testkit.TestKitBase$class.expectMsgClass_internal(TestKit.scala:423)
at akka.testkit.TestKitBase$class.expectMsgType(TestKit.scala:405)
at akka.testkit.TestKit.expectMsgType(TestKit.scala:718)
at 
org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply$mcV$sp(JobManagerITCase.scala:772)
at 
org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply(JobManagerITCase.scala:764)
at 
org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34$$anonfun$apply$mcV$sp$35.apply(JobManagerITCase.scala:764)
at akka.testkit.TestKitBase$class.within(TestKit.scala:296)
at akka.testkit.TestKit.within(TestKit.scala:718)
at akka.testkit.TestKitBase$class.within(TestKit.scala:310)
at akka.testkit.TestKit.within(TestKit.scala:718)
at 
org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply$mcV$sp(JobManagerITCase.scala:764)
at 
org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply(JobManagerITCase.scala:758)
at 
org.apache.flink.runtime.jobmanager.JobManagerITCase$$anonfun$1$$anonfun$apply$mcV$sp$34.apply(JobManagerITCase.scala:758)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.WordSpecLike$$anon$1.apply(WordSpecLike.scala:953)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at 
org.apache.flink.runtime.jobmanager.JobManagerITCase.withFixture(JobManagerITCase.scala:50)

{code}

> Flakey JobManagerITCase
> ---
>
> Key: FLINK-6293
> URL: https://issues.apache.org/jira/browse/FLINK-6293
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager, Tests
>Affects Versions: 1.3.0
>Reporter: Nico Kruber
>Priority: Critical
>  Labels: test-stability
>
> Quite seldomly, {{JobManagerITCase}} seems to hang, e.g. see 
> https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true 
> The maven watchdog kills the build due to not output being produced within 
> 300s and {{JobManagerITCase}} seems to hang in line 772, i.e.
> {code:title=JobManagerITCase lines 
> 770-772|language=java|linenumbers=true|firstline=770}
> // Trigger savepoint for non-existing job
> jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
> val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
> {code}
> Although the (downloaded) logs do not quite allow a precise mapping to this 
> test case, it looks as if the following block may be related:
> {code}
> 09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Akka ask timeout set to 100s
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Disabled queryable state server
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Starting FlinkMiniCluster.
> 09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger   
>- Slf4jLogger started
> 09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Created BLOB server storage directory 
> /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
> 09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils  
>- Not a SSL socket, will skip setting tls version and cipher suites.
> 09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max 
> backlog: 1000
> 09:34:47,840 INFO  

[jira] [Commented] (FLINK-6293) Flakey JobManagerITCase

2017-04-20 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976323#comment-15976323
 ] 

Stephan Ewen commented on FLINK-6293:
-

Another failed instance: 
https://s3.amazonaws.com/archive.travis-ci.org/jobs/223677152/log.txt

> Flakey JobManagerITCase
> ---
>
> Key: FLINK-6293
> URL: https://issues.apache.org/jira/browse/FLINK-6293
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager, Tests
>Affects Versions: 1.3.0
>Reporter: Nico Kruber
>  Labels: test-stability
>
> Quite seldomly, {{JobManagerITCase}} seems to hang, e.g. see 
> https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true 
> The maven watchdog kills the build due to not output being produced within 
> 300s and {{JobManagerITCase}} seems to hang in line 772, i.e.
> {code:title=JobManagerITCase lines 
> 770-772|language=java|linenumbers=true|firstline=770}
> // Trigger savepoint for non-existing job
> jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
> val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
> {code}
> Although the (downloaded) logs do not quite allow a precise mapping to this 
> test case, it looks as if the following block may be related:
> {code}
> 09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Akka ask timeout set to 100s
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Disabled queryable state server
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Starting FlinkMiniCluster.
> 09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger   
>- Slf4jLogger started
> 09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Created BLOB server storage directory 
> /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
> 09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils  
>- Not a SSL socket, will skip setting tls version and cipher suites.
> 09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max 
> backlog: 1000
> 09:34:47,840 INFO  org.apache.flink.runtime.metrics.MetricRegistry
>- No metrics reporter configured, no metrics will be exposed/reported.
> 09:34:47,850 INFO  
> org.apache.flink.runtime.testingUtils.TestingMemoryArchivist  - Started 
> memory archivist akka://flink/user/archive_1
> 09:34:47,860 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Trying to associate with JobManager leader akka://flink/user/jobmanager_1
> 09:34:47,861 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Starting JobManager at akka://flink/user/jobmanager_1.
> 09:34:47,862 WARN  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Discard message 
> LeaderSessionMessage(----,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any)))
>  because there is currently no valid leader id known.
> 09:34:47,862 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- JobManager akka://flink/user/jobmanager_1 was granted leadership with 
> leader session ID Some(----).
> 09:34:47,867 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Resource Manager associating with leading JobManager 
> Actor[akka://flink/user/jobmanager_1#-652927556] - leader session 
> ----
> {code}
> If so, then this may be related to FLINK-6287 and may possibly even be a 
> duplicate.
> What is strange though is that the timeout for the expected message to arrive 
> is no more than 2m and thus the test should properly fail within 300s.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6293) Flakey JobManagerITCase

2017-04-11 Thread Nico Kruber (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964138#comment-15964138
 ] 

Nico Kruber commented on FLINK-6293:


Same here (with only the {{transfer.sh}} upload changed compared to master)
https://s3.amazonaws.com/archive.travis-ci.org/jobs/220888197/log.txt

> Flakey JobManagerITCase
> ---
>
> Key: FLINK-6293
> URL: https://issues.apache.org/jira/browse/FLINK-6293
> Project: Flink
>  Issue Type: Bug
>  Components: Job-Submission, Tests
>Affects Versions: 1.3.0
>Reporter: Nico Kruber
>
> Quite seldomly, {{JobManagerITCase}} seems to hang, e.g. see 
> https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true 
> The maven watchdog kills the build due to not output being produced within 
> 300s and {{JobManagerITCase}} seems to hang in line 772, i.e.
> {code:title=JobManagerITCase lines 
> 770-772|language=java|linenumbers=true|firstline=770}
> // Trigger savepoint for non-existing job
> jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor)
> val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft)
> {code}
> Although the (downloaded) logs do not quite allow a precise mapping to this 
> test case, it looks as if the following block may be related:
> {code}
> 09:34:47,684 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Akka ask timeout set to 100s
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Disabled queryable state server
> 09:34:47,777 INFO  org.apache.flink.runtime.minicluster.FlinkMiniCluster  
>- Starting FlinkMiniCluster.
> 09:34:47,809 INFO  akka.event.slf4j.Slf4jLogger   
>- Slf4jLogger started
> 09:34:47,837 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Created BLOB server storage directory 
> /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062
> 09:34:47,838 WARN  org.apache.flink.runtime.net.SSLUtils  
>- Not a SSL socket, will skip setting tls version and cipher suites.
> 09:34:47,839 INFO  org.apache.flink.runtime.blob.BlobServer   
>- Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max 
> backlog: 1000
> 09:34:47,840 INFO  org.apache.flink.runtime.metrics.MetricRegistry
>- No metrics reporter configured, no metrics will be exposed/reported.
> 09:34:47,850 INFO  
> org.apache.flink.runtime.testingUtils.TestingMemoryArchivist  - Started 
> memory archivist akka://flink/user/archive_1
> 09:34:47,860 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Trying to associate with JobManager leader akka://flink/user/jobmanager_1
> 09:34:47,861 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Starting JobManager at akka://flink/user/jobmanager_1.
> 09:34:47,862 WARN  org.apache.flink.runtime.testingUtils.TestingJobManager
>- Discard message 
> LeaderSessionMessage(----,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any)))
>  because there is currently no valid leader id known.
> 09:34:47,862 INFO  org.apache.flink.runtime.testingUtils.TestingJobManager
>- JobManager akka://flink/user/jobmanager_1 was granted leadership with 
> leader session ID Some(----).
> 09:34:47,867 INFO  org.apache.flink.runtime.testutils.TestingResourceManager  
>- Resource Manager associating with leading JobManager 
> Actor[akka://flink/user/jobmanager_1#-652927556] - leader session 
> ----
> {code}
> If so, then this may be related to FLINK-6287 and may possibly even be a 
> duplicate.
> What is strange though is that the timeout for the expected message to arrive 
> is no more than 2m and thus the test should properly fail within 300s.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)