[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-03-31 Thread Yu Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071866#comment-17071866
 ] 

Yu Li commented on FLINK-16018:
---

Thanks for the followup and quick fix [~trohrmann]!

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Robert Metzger
>Assignee: Till Rohrmann
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.3, 1.10.1, 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-03-30 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070839#comment-17070839
 ] 

Till Rohrmann commented on FLINK-16018:
---

As stated before, the underlying problem is that we wait for the {{JobManager}} 
creation before acknowledging the job submission. Hence, the proper fix would 
be to make the job submission "non-blocking" (technically is already non 
blocking but the response is blocked). 

Since this effort is a bit bigger I would suggest to do the following:

* As part of this issue and as a quick fix, we increase {{web.timeout}} to 10 
minutes.
* I'll create a follow up issue to make the job submission non-blocking which 
we try to fix asap





> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Robert Metzger
>Assignee: Till Rohrmann
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-03-29 Thread Yu Li (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070667#comment-17070667
 ] 

Yu Li commented on FLINK-16018:
---

Hi folks, any update on this one? Asking since it's marked as a blocker for 
1.10.1 but seems to be stalled. Thanks.

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Robert Metzger
>Assignee: Till Rohrmann
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-03-09 Thread Zili Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055566#comment-17055566
 ] 

Zili Chen commented on FLINK-16018:
---

A general thought is we respect {{timeout}} parameter in 
{{Dispatcher#submitJob}}, having a field that helps determine the progress, and 
complete the future on Timeout with that field(stringified in 
{{JobSubmissionException}}).

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Robert Metzger
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-03-09 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054903#comment-17054903
 ] 

Stephan Ewen commented on FLINK-16018:
--

Saw another occurrence, raising this to BLOCKER.

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Robert Metzger
>Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-03-06 Thread Zili Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053579#comment-17053579
 ] 

Zili Chen commented on FLINK-16018:
---

[~trohrmann] We internally return immediately on {{JobManagerRunnerFuture}} 
created instead of the {{JobManagerRunner}} created, and we regard the latter 
part of job execution. Internally we do it as a follow up of FLINK-10333 
because it is included in a whole story we properly handle job status in 
{{Dispatcher}} and required correctly manipulation of {{JobGraphStore}} & 
{{JobRegistry}}.

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Critical
> Fix For: 1.10.1, 1.11.0
>
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-03-06 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053557#comment-17053557
 ] 

Robert Metzger commented on FLINK-16018:


I'm increasing the priority of this ticket to Critical (feel free to disagree 
with me)

Another user opened a ticket (FLINK-16429), and there's a similar report on the 
user list: 
https://lists.apache.org/thread.html/r05bd8e13cd7d1b2198bd9e275ff6946af428d4480f75ff5a54200825%40%3Cuser.flink.apache.org%3E

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-02-18 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039244#comment-17039244
 ] 

Till Rohrmann commented on FLINK-16018:
---

A small addition: The creation of the {{JobManager}} is not blocking the main 
thread of the {{Dispatcher}}. The problem is that we only complete the future 
returned by {{submitJob}} after the {{JobManager}} has been created and which 
takes longer than the timeout on the {{RestServer}}.

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Major
> Fix For: 1.10.1, 1.11.0
>
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-02-18 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039228#comment-17039228
 ] 

Till Rohrmann commented on FLINK-16018:
---

I agree that {{TimeoutExceptions}} should not surface to the user. They are 
hard to understand and cryptic.

I also agree that the root problem is the blocking operation which we should 
run in a separate thread(pool).

Apart from these two aspects, I actually agree with Andrey that we have a 
problem of un-aligned timeouts here. The {{AskTimeoutException}} the user sees 
is caused by the {{JobSubmitHandler}} which waits on the 
{{DisptacherGateway.submitJob}} with a timeout of 10s (default value). The 
{{JobManager}} creation depends on the Amazon SDK client which, I assume, does 
some retries and delays in between. Since the 10s are smaller than the overall 
SDK client timeout, we only see the {{AskTimeoutException}}. This is also 
underlined by Robert's post where he set {{web.timeout}} to 30s and then sees 
the actual cause of failure.

Retrying the operation in case of a timeout is at the moment not an option 
since the submit operation is not idempotent. Also, 
{{DispatcherGateway.submitJob}} currently requires the creation of the 
{{ExecutionGraph}} before returning. One idea could be to not wait for the 
creation before acknowledging the job submission. One could define the job 
submission as that all files have been uploaded to the cluster. This is 
actually something I wanted to change with FLINK-11719.

Another thought could be to say that we require the {{Dispatcher}} to always 
respond eventually. If this assumption holds, then we could set the 
{{web.timeout}} to infinity. We do this at other places where we now that the 
actors run in the same actor system as well. As long as the rest server and the 
cluster entrypoint run in the same JVM, this assumption might be ok. Of course, 
the consequence of such a change would be that any kind of (dead/live)locks in 
downstream components would propagate through to the {{RestServer}}. Maybe 
setting {{web.timeout}} to 10 minutes would then be a compromise.

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Major
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-02-14 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036957#comment-17036957
 ] 

Stephan Ewen commented on FLINK-16018:
--

I don't think this is a misalignment of timeouts. Indeed, when an Akka ask 
timeout propagates to the user, it means we are not handling failures correctly 
and simply report back whatever happened in the RPC system.

In this specific issue, we also have the problem of blocking/synchronous 
ExecutionGraph creation. Because some amount of synchronous initialization will 
most likely always be there, we would need some better way to handle responses 
to the client. Maybe an additional JobStatus (like INITIALIZING).

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Major
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-02-14 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036944#comment-17036944
 ] 

Robert Metzger commented on FLINK-16018:


Thank you for looking at the ticket. I had an offline chat with [~sewen]. He 
asked me to open this ticket, because there should never be the a case where we 
show an akka ask timeout to the user.

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Major
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-02-14 Thread Andrey Zagrebin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036904#comment-17036904
 ] 

Andrey Zagrebin commented on FLINK-16018:
-

In this particular case, it was just un-alignment of timeouts which is hard to 
get between framework and custom operation timeouts.

So basically, it can always happen unless submission operation asynchronously 
monitors some general response timeout for each operation invoked during the 
submission then it can report where it got stuck.

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> --
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Major
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [1 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16018) Improve error reporting when submitting batch job (instead of AskTimeoutException)

2020-02-12 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035325#comment-17035325
 ] 

Robert Metzger commented on FLINK-16018:


Setting the config to {{web.timeout: 30}} reveals the real underlying issue:
{code}
2020-02-07T15:59:57.2209501Z 2020-02-07 15:59:50,547 ERROR 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Failed to 
submit job 115b0668417af408b4a129499c634396.
2020-02-07T15:59:57.2209618Z java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobExecutionException: Could not set up 
JobManager
2020-02-07T15:59:57.2209725Zat 
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
2020-02-07T15:59:57.2209812Zat 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
2020-02-07T15:59:57.2209908Zat 
akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
2020-02-07T15:59:57.2209993Zat 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
2020-02-07T15:59:57.2210098Zat 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2020-02-07T15:59:57.2210177Zat 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
2020-02-07T15:59:57.2210274Zat 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
2020-02-07T15:59:57.2210433Zat 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-02-07T15:59:57.2210542Z Caused by: 
org.apache.flink.runtime.client.JobExecutionException: Could not set up 
JobManager
2020-02-07T15:59:57.2210631Zat 
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:152)
2020-02-07T15:59:57.2210744Zat 
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)
2020-02-07T15:59:57.2210844Zat 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379)
2020-02-07T15:59:57.2210945Zat 
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
2020-02-07T15:59:57.2211019Z... 7 more
2020-02-07T15:59:57.2211550Z Caused by: 
org.apache.flink.runtime.client.JobExecutionException: Cannot initialize task 
'DataSink (CsvOutputFormat (path: 
s3://test-data/temp/test_batch_wordcount-0205c494-01da-4cde-ae74-1925833efb57, 
delimiter:  ))': doesBucketExist on test-data: 
com.amazonaws.SdkClientException: Unable to execute HTTP request: minio: Unable 
to execute HTTP request: minio
2020-02-07T15:59:57.2211736Zat 
org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:216)
2020-02-07T15:59:57.2211831Zat 
org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:253)
2020-02-07T15:59:57.2211942Zat 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:225)
2020-02-07T15:59:57.2212028Zat 
org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:213)
2020-02-07T15:59:57.2212127Zat 
org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:117)
2020-02-07T15:59:57.2212218Zat 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
2020-02-07T15:59:57.2212330Zat 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
2020-02-07T15:59:57.2212497Zat 
org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:266)
2020-02-07T15:59:57.2212592Zat 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
2020-02-07T15:59:57.2212714Zat 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
2020-02-07T15:59:57.2212815Zat 
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:146)
2020-02-07T15:59:57.2212902Z... 10 more
2020-02-07T15:59:57.2213257Z Caused by: 
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on test-data: 
com.amazonaws.SdkClientException: Unable to execute HTTP request: minio: Unable 
to execute HTTP request: minio
2020-02-07T15:59:57.2213378Zat 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:177)
2020-02-07T15:59:57.2213464Zat 
org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
2020-02-07T15:59:57.2213565Zat 
org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:260)
2020-02-07T15:59:57.2213641Zat 
org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:317)
2020-02-07T15:59:57.2213731Zat 
org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256)
2020-02-07T15:59:57.2213911Zat 
org.ap