[jira] [Commented] (FLINK-12382) HA + ResourceManager exception: Fencing token not set

2022-01-19 Thread Henrik (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478579#comment-17478579
 ] 

Henrik commented on FLINK-12382:


Excellent, thank you!




> HA + ResourceManager exception: Fencing token not set
> -
>
> Key: FLINK-12382
> URL: https://issues.apache.org/jira/browse/FLINK-12382
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same all all previous bugs filed by myself, today, but 
> this time with HA with zetcd.
>Reporter: Henrik
>Priority: Major
> Attachments: jobmanager_log, jobmanager_stdout
>
>
> I'm testing zetcd + session jobs in k8s, and testing what happens when I kill 
> both the job-cluster and task-manager at the same time, but maintain ZK/zetcd 
> up and running.
> Then I get this stacktrace, that's completely non-actionable for me, and also 
> resolves itself. I expect a number of retries, and if this exception is part 
> of the protocol signalling to retry, then it should not be printed as a log 
> entry.
> This might be related to an older bug: 
> [https://jira.apache.org/jira/browse/FLINK-7734]
> {code:java}
> [tm] 2019-05-01 11:32:01,641 ERROR 
> org.apache.flink.runtime.taskexecutor.TaskExecutor    - Registration 
> at ResourceManager failed due to an error
> [tm] java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
> not set: Ignoring message 
> RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
> RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
> HardwareDescription, Time))) sent to 
> akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
> token is null.
> [tm]     at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> [tm]     at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> [tm]     at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> [tm]     at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> [tm]     at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> [tm]     at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> [tm]     at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> [tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> [tm]     at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> [tm]     at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> [tm]     at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> [tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> [tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
> [tm]     at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: 
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
> not set: Ignoring message 
> RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
> RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
> HardwareDescription, Time))) sent to 
> akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
> token is null.
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]  

[jira] [Resolved] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit

2022-01-19 Thread Henrik (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik resolved FLINK-12376.

Resolution: Fixed

Seens PR has been made: in any case, this is a stale issue.

> GCS runtime exn: Request payload size exceeds the limit
> ---
>
> Key: FLINK-12376
> URL: https://issues.apache.org/jira/browse/FLINK-12376
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Google Cloud PubSub
>Affects Versions: 1.7.2
> Environment: FROM flink:1.8.0-scala_2.11
> ARG version=0.17
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib
> COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar
>Reporter: Henrik
>Priority: Not a Priority
>  Labels: auto-deprioritized-major, auto-deprioritized-minor, 
> auto-unassigned
> Attachments: Screenshot 2019-04-30 at 22.32.34.png, Screenshot 
> 2019-05-08 at 12.41.07.png
>
>
> I'm trying to use the google cloud storage file system, but it would seem 
> that the FLINK / GCS client libs are creating too-large requests far down in 
> the GCS Java client.
> The Java client is added to the lib folder with this command in Dockerfile 
> (probably 
> [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar]
>  at the time of writing):
>  
> {code:java}
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib{code}
> This is the crash output. Focus lines:
> {code:java}
> java.lang.RuntimeException: Error while confirming checkpoint{code}
> and
> {code:java}
>  Caused by: com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.{code}
> Full stacktrace:
>  
> {code:java}
> [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO  
> org.apache.flink.runtime.taskmanager.Task - Source: 
> Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) 
> (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED.
> [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error 
> while confirming checkpoint
> [analytics-867c867ff6-l622h taskmanager]     at 
> org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.lang.Thread.run(Thread.java:748)
> [analytics-867c867ff6-l622h taskmanager] Caused by: 
> com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
> [analytics-867c867ff6-l622h taskmanager]     at 
> io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
> [analytics-867c867ff6-l622h taskmanager]     at 
> 

[jira] [Commented] (FLINK-12382) HA + ResourceManager exception: Fencing token not set

2022-01-19 Thread Henrik (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478562#comment-17478562
 ] 

Henrik commented on FLINK-12382:


Do you support running Flink without ZooKeeper now, e.g. with k8s leader 
election?

> HA + ResourceManager exception: Fencing token not set
> -
>
> Key: FLINK-12382
> URL: https://issues.apache.org/jira/browse/FLINK-12382
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same all all previous bugs filed by myself, today, but 
> this time with HA with zetcd.
>Reporter: Henrik
>Priority: Major
> Attachments: jobmanager_log, jobmanager_stdout
>
>
> I'm testing zetcd + session jobs in k8s, and testing what happens when I kill 
> both the job-cluster and task-manager at the same time, but maintain ZK/zetcd 
> up and running.
> Then I get this stacktrace, that's completely non-actionable for me, and also 
> resolves itself. I expect a number of retries, and if this exception is part 
> of the protocol signalling to retry, then it should not be printed as a log 
> entry.
> This might be related to an older bug: 
> [https://jira.apache.org/jira/browse/FLINK-7734]
> {code:java}
> [tm] 2019-05-01 11:32:01,641 ERROR 
> org.apache.flink.runtime.taskexecutor.TaskExecutor    - Registration 
> at ResourceManager failed due to an error
> [tm] java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
> not set: Ignoring message 
> RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
> RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
> HardwareDescription, Time))) sent to 
> akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
> token is null.
> [tm]     at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> [tm]     at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> [tm]     at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> [tm]     at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> [tm]     at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> [tm]     at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> [tm]     at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> [tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> [tm]     at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> [tm]     at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> [tm]     at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> [tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> [tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
> [tm]     at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: 
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
> not set: Ignoring message 
> RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
> RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
> HardwareDescription, Time))) sent to 
> akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
> token is null.
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [tm]     at 
> 

[jira] [Commented] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit

2021-06-08 Thread Henrik (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359126#comment-17359126
 ] 

Henrik commented on FLINK-12376:


It's messed up that you let this linger for two years before acting on it and 
when you act on it, you let a bot do it.

> GCS runtime exn: Request payload size exceeds the limit
> ---
>
> Key: FLINK-12376
> URL: https://issues.apache.org/jira/browse/FLINK-12376
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Google Cloud PubSub
>Affects Versions: 1.7.2
> Environment: FROM flink:1.8.0-scala_2.11
> ARG version=0.17
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib
> COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar
>Reporter: Henrik
>Priority: Major
>  Labels: auto-unassigned, stale-major
> Attachments: Screenshot 2019-04-30 at 22.32.34.png, Screenshot 
> 2019-05-08 at 12.41.07.png
>
>
> I'm trying to use the google cloud storage file system, but it would seem 
> that the FLINK / GCS client libs are creating too-large requests far down in 
> the GCS Java client.
> The Java client is added to the lib folder with this command in Dockerfile 
> (probably 
> [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar]
>  at the time of writing):
>  
> {code:java}
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib{code}
> This is the crash output. Focus lines:
> {code:java}
> java.lang.RuntimeException: Error while confirming checkpoint{code}
> and
> {code:java}
>  Caused by: com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.{code}
> Full stacktrace:
>  
> {code:java}
> [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO  
> org.apache.flink.runtime.taskmanager.Task - Source: 
> Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) 
> (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED.
> [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error 
> while confirming checkpoint
> [analytics-867c867ff6-l622h taskmanager]     at 
> org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.lang.Thread.run(Thread.java:748)
> [analytics-867c867ff6-l622h taskmanager] Caused by: 
> com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
> [analytics-867c867ff6-l622h taskmanager]     at 
> io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
> [analytics-867c867ff6-l622h taskmanager]     at 
> 

[jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

2019-05-17 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842012#comment-16842012
 ] 

Henrik commented on FLINK-12384:


It's misbehaving in all different sorts of ways (filter by my reported bugs); 
but not in any way that I can deduce this as the cause of. Then this ticket is 
about removing the warning, as it's not a warning (but info).

> Rolling the etcd servers causes "Connected to an old server; r-o mode will be 
> unavailable"
> --
>
> Key: FLINK-12384
> URL: https://issues.apache.org/jira/browse/FLINK-12384
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> [tm] 2019-05-01 13:30:53,316 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - 
> Initiating client connection, connectString=analytics-zetcd:2181 
> sessionTimeout=6 
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
> [tm] 2019-05-01 13:30:53,384 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it.
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
> socket connection to server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Using 
> configured hostname/address for TaskManager: 10.1.2.173.
> [tm] 2019-05-01 13:30:53,401 ERROR 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
> Authentication failed
> [tm] 2019-05-01 13:30:53,418 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to 
> start actor system at 10.1.2.173:0
> [tm] 2019-05-01 13:30:53,420 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
> connection established to 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating 
> session
> [tm] 2019-05-01 13:30:53,500 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
> Connected to an old server; r-o mode will be unavailable
> [tm] 2019-05-01 13:30:53,500 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
> establishment complete on server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
> 0xbf06a739001d446, negotiated timeout = 6
> [tm] 2019-05-01 13:30:53,525 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: CONNECTED{code}
> Repro:
> Start an etcd-cluster, with e.g. etcd-operator, with three members. Start 
> zetcd in front. Configure the sesssion cluster to go against zetcd.
> Ensure the job can start successfully.
> Now, kill the etcd pods one by one, letting the quorum re-establish in 
> between, so that the cluster is still OK.
> Now restart the job/tm pods. You'll end up in this no-mans-land.
>  
> —
> Workaround: clean out the etcd cluster and remove all its data, however, this 
> resets all time windows and state, despite having that saved in GCS, so it's 
> a crappy workaround.
>  
> –
>  
> flink-conf.yaml
> {code:java}
> parallelism.default: 1
> rest.address: analytics-job
> jobmanager.rpc.address: analytics-job # = resource manager's address too
> jobmanager.heap.size: 1024m
> jobmanager.rpc.port: 6123
> jobmanager.slot.request.timeout: 3
> resourcemanager.rpc.port: 6123
> high-availability.jobmanager.port: 6123
> blob.server.port: 6124
> queryable-state.server.ports: 6125
> taskmanager.heap.size: 1024m
> taskmanager.numberOfTaskSlots: 1
> web.log.path: /var/lib/log/flink/jobmanager.log
> rest.port: 8081
> rest.bind-address: 0.0.0.0
> web.submit.enable: false
> high-availability: zookeeper
> high-availability.storageDir: gs://example_analytics/flink/zetcd/
> high-availability.zookeeper.quorum: analytics-zetcd:2181
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.client.acl: open
> state.backend: rocksdb
> state.checkpoints.num-retained: 3
> state.checkpoints.dir: gs://example_analytics/flink/checkpoints
> state.savepoints.dir: gs://example_analytics/flink/savepoints
> state.backend.incremental: true
> state.backend.async: true
> fs.hdfs.hadoopconf: /opt/flink/hadoop
> log.file: /var/lib/log/flink/jobmanager.log{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

2019-05-16 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841391#comment-16841391
 ] 

Henrik commented on FLINK-12384:


zetcd refers to the project by the name 
[https://github.com/helm/charts/tree/master/stable/zetcd/]

I'm not running ZK at all, but etcd.

> Rolling the etcd servers causes "Connected to an old server; r-o mode will be 
> unavailable"
> --
>
> Key: FLINK-12384
> URL: https://issues.apache.org/jira/browse/FLINK-12384
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> [tm] 2019-05-01 13:30:53,316 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - 
> Initiating client connection, connectString=analytics-zetcd:2181 
> sessionTimeout=6 
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
> [tm] 2019-05-01 13:30:53,384 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it.
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
> socket connection to server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Using 
> configured hostname/address for TaskManager: 10.1.2.173.
> [tm] 2019-05-01 13:30:53,401 ERROR 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
> Authentication failed
> [tm] 2019-05-01 13:30:53,418 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to 
> start actor system at 10.1.2.173:0
> [tm] 2019-05-01 13:30:53,420 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
> connection established to 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating 
> session
> [tm] 2019-05-01 13:30:53,500 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
> Connected to an old server; r-o mode will be unavailable
> [tm] 2019-05-01 13:30:53,500 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
> establishment complete on server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
> 0xbf06a739001d446, negotiated timeout = 6
> [tm] 2019-05-01 13:30:53,525 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: CONNECTED{code}
> Repro:
> Start an etcd-cluster, with e.g. etcd-operator, with three members. Start 
> zetcd in front. Configure the sesssion cluster to go against zetcd.
> Ensure the job can start successfully.
> Now, kill the etcd pods one by one, letting the quorum re-establish in 
> between, so that the cluster is still OK.
> Now restart the job/tm pods. You'll end up in this no-mans-land.
>  
> —
> Workaround: clean out the etcd cluster and remove all its data, however, this 
> resets all time windows and state, despite having that saved in GCS, so it's 
> a crappy workaround.
>  
> –
>  
> flink-conf.yaml
> {code:java}
> parallelism.default: 1
> rest.address: analytics-job
> jobmanager.rpc.address: analytics-job # = resource manager's address too
> jobmanager.heap.size: 1024m
> jobmanager.rpc.port: 6123
> jobmanager.slot.request.timeout: 3
> resourcemanager.rpc.port: 6123
> high-availability.jobmanager.port: 6123
> blob.server.port: 6124
> queryable-state.server.ports: 6125
> taskmanager.heap.size: 1024m
> taskmanager.numberOfTaskSlots: 1
> web.log.path: /var/lib/log/flink/jobmanager.log
> rest.port: 8081
> rest.bind-address: 0.0.0.0
> web.submit.enable: false
> high-availability: zookeeper
> high-availability.storageDir: gs://example_analytics/flink/zetcd/
> high-availability.zookeeper.quorum: analytics-zetcd:2181
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.client.acl: open
> state.backend: rocksdb
> state.checkpoints.num-retained: 3
> state.checkpoints.dir: gs://example_analytics/flink/checkpoints
> state.savepoints.dir: gs://example_analytics/flink/savepoints
> state.backend.incremental: true
> state.backend.async: true
> fs.hdfs.hadoopconf: /opt/flink/hadoop
> log.file: /var/lib/log/flink/jobmanager.log{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

2019-05-16 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12384:
---
Description: 
{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Using 
configured hostname/address for TaskManager: 10.1.2.173.
[tm] 2019-05-01 13:30:53,401 ERROR 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
Authentication failed
[tm] 2019-05-01 13:30:53,418 INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start 
actor system at 10.1.2.173:0
[tm] 2019-05-01 13:30:53,420 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session
[tm] 2019-05-01 13:30:53,500 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
Connected to an old server; r-o mode will be unavailable
[tm] 2019-05-01 13:30:53,500 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
0xbf06a739001d446, negotiated timeout = 6
[tm] 2019-05-01 13:30:53,525 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: CONNECTED{code}
Repro:

Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd 
in front. Configure the sesssion cluster to go against zetcd.

Ensure the job can start successfully.

Now, kill the etcd pods one by one, letting the quorum re-establish in between, 
so that the cluster is still OK.

Now restart the job/tm pods. You'll end up in this no-mans-land.

 

—

Workaround: clean out the etcd cluster and remove all its data, however, this 
resets all time windows and state, despite having that saved in GCS, so it's a 
crappy workaround.

 

–

 

flink-conf.yaml
{code:java}
parallelism.default: 1
rest.address: analytics-job
jobmanager.rpc.address: analytics-job # = resource manager's address too
jobmanager.heap.size: 1024m
jobmanager.rpc.port: 6123
jobmanager.slot.request.timeout: 3
resourcemanager.rpc.port: 6123
high-availability.jobmanager.port: 6123
blob.server.port: 6124
queryable-state.server.ports: 6125
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 1
web.log.path: /var/lib/log/flink/jobmanager.log
rest.port: 8081
rest.bind-address: 0.0.0.0
web.submit.enable: false
high-availability: zookeeper
high-availability.storageDir: gs://example_analytics/flink/zetcd/
high-availability.zookeeper.quorum: analytics-zetcd:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.client.acl: open
state.backend: rocksdb
state.checkpoints.num-retained: 3
state.checkpoints.dir: gs://example_analytics/flink/checkpoints
state.savepoints.dir: gs://example_analytics/flink/savepoints
state.backend.incremental: true
state.backend.async: true
fs.hdfs.hadoopconf: /opt/flink/hadoop
log.file: /var/lib/log/flink/jobmanager.log{code}

  was:
{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
[tm] 2019-05-01 13:30:53,395 

[jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

2019-05-16 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841133#comment-16841133
 ] 

Henrik commented on FLINK-12384:


Thanks for hte ping [~gjy]

I've updated the issue with that information.

> Rolling the etcd servers causes "Connected to an old server; r-o mode will be 
> unavailable"
> --
>
> Key: FLINK-12384
> URL: https://issues.apache.org/jira/browse/FLINK-12384
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> [tm] 2019-05-01 13:30:53,316 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - 
> Initiating client connection, connectString=analytics-zetcd:2181 
> sessionTimeout=6 
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
> [tm] 2019-05-01 13:30:53,384 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue 
> connection to Zookeeper server without SASL authentication, if Zookeeper 
> server allows it.
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
> socket connection to server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
> [tm] 2019-05-01 13:30:53,395 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Using 
> configured hostname/address for TaskManager: 10.1.2.173.
> [tm] 2019-05-01 13:30:53,401 ERROR 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
> Authentication failed
> [tm] 2019-05-01 13:30:53,418 INFO  
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to 
> start actor system at 10.1.2.173:0
> [tm] 2019-05-01 13:30:53,420 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
> connection established to 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating 
> session
> [tm] 2019-05-01 13:30:53,500 WARN  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
> Connected to an old server; r-o mode will be unavailable
> [tm] 2019-05-01 13:30:53,500 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
> establishment complete on server 
> analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
> 0xbf06a739001d446, negotiated timeout = 6
> [tm] 2019-05-01 13:30:53,525 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>   - State change: CONNECTED{code}
> Repro:
> Start an etcd-cluster, with e.g. etcd-operator, with three members. Start 
> zetcd in front. Configure the sesssion cluster to go against zetcd.
> Ensure the job can start successfully.
> Now, kill the etcd pods one by one, letting the quorum re-establish in 
> between, so that the cluster is still OK.
> Now restart the job/tm pods. You'll end up in this no-mans-land.
>  
> —
> Workaround: clean out the etcd cluster and remove all its data, however, this 
> resets all time windows and state, despite having that saved in GCS, so it's 
> a crappy workaround.
>  
> --
>  
> flink-conf.yaml
> {{ parallelism.default: 1}}
> {{ rest.address: analytics-job}}
> {{ jobmanager.rpc.address: analytics-job # = resource manager's address too}}
> {{ jobmanager.heap.size: 1024m}}
> {{ jobmanager.rpc.port: 6123}}
> {{ jobmanager.slot.request.timeout: 3}}
> {{ resourcemanager.rpc.port: 6123}}
> {{ high-availability.jobmanager.port: 6123}}
> {{ blob.server.port: 6124}}
> {{ queryable-state.server.ports: 6125}}
> {{ taskmanager.heap.size: 1024m}}
> {{ taskmanager.numberOfTaskSlots: 1}}
> {{ web.log.path: /var/lib/log/flink/jobmanager.log}}
> {{ rest.port: 8081}}
> {{ rest.bind-address: 0.0.0.0}}
> {{ web.submit.enable: false}}
> {{ high-availability: zookeeper}}
> {{ high-availability.storageDir: 
> gs://project-id-example_analytics/flink/zetcd/}}
> {{ high-availability.zookeeper.quorum: analytics-zetcd:2181}}
> {{ high-availability.zookeeper.path.root: /flink}}
> {{ high-availability.zookeeper.client.acl: open}}
> {{ state.backend: rocksdb}}
> {{ state.checkpoints.num-retained: 3}}
> {{ state.checkpoints.dir: 
> gs://project-id-example_analytics/flink/checkpoints}}
> {{ state.savepoints.dir: 
> gs://}}{{project-id-example}}{{_analytics/flink/savepoints}}
> {{ state.backend.incremental: true}}
> {{ state.backend.async: true}}
> {{ fs.hdfs.hadoopconf: /opt/flink/hadoop}}
> {{ log.file: /var/lib/log/flink/jobmanager.log}}



--
This message was sent by 

[jira] [Updated] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

2019-05-16 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12384:
---
Description: 
{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Using 
configured hostname/address for TaskManager: 10.1.2.173.
[tm] 2019-05-01 13:30:53,401 ERROR 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
Authentication failed
[tm] 2019-05-01 13:30:53,418 INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start 
actor system at 10.1.2.173:0
[tm] 2019-05-01 13:30:53,420 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session
[tm] 2019-05-01 13:30:53,500 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
Connected to an old server; r-o mode will be unavailable
[tm] 2019-05-01 13:30:53,500 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
0xbf06a739001d446, negotiated timeout = 6
[tm] 2019-05-01 13:30:53,525 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: CONNECTED{code}
Repro:

Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd 
in front. Configure the sesssion cluster to go against zetcd.

Ensure the job can start successfully.

Now, kill the etcd pods one by one, letting the quorum re-establish in between, 
so that the cluster is still OK.

Now restart the job/tm pods. You'll end up in this no-mans-land.

 

—

Workaround: clean out the etcd cluster and remove all its data, however, this 
resets all time windows and state, despite having that saved in GCS, so it's a 
crappy workaround.

 

--

 

flink-conf.yaml

{{ parallelism.default: 1}}
{{ rest.address: analytics-job}}
{{ jobmanager.rpc.address: analytics-job # = resource manager's address too}}
{{ jobmanager.heap.size: 1024m}}
{{ jobmanager.rpc.port: 6123}}
{{ jobmanager.slot.request.timeout: 3}}
{{ resourcemanager.rpc.port: 6123}}
{{ high-availability.jobmanager.port: 6123}}
{{ blob.server.port: 6124}}
{{ queryable-state.server.ports: 6125}}
{{ taskmanager.heap.size: 1024m}}
{{ taskmanager.numberOfTaskSlots: 1}}
{{ web.log.path: /var/lib/log/flink/jobmanager.log}}
{{ rest.port: 8081}}
{{ rest.bind-address: 0.0.0.0}}
{{ web.submit.enable: false}}
{{ high-availability: zookeeper}}
{{ high-availability.storageDir: 
gs://project-id-example_analytics/flink/zetcd/}}
{{ high-availability.zookeeper.quorum: analytics-zetcd:2181}}
{{ high-availability.zookeeper.path.root: /flink}}
{{ high-availability.zookeeper.client.acl: open}}
{{ state.backend: rocksdb}}
{{ state.checkpoints.num-retained: 3}}
{{ state.checkpoints.dir: gs://project-id-example_analytics/flink/checkpoints}}
{{ state.savepoints.dir: 
gs://}}{{project-id-example}}{{_analytics/flink/savepoints}}
{{ state.backend.incremental: true}}
{{ state.backend.async: true}}
{{ fs.hdfs.hadoopconf: /opt/flink/hadoop}}
{{ log.file: /var/lib/log/flink/jobmanager.log}}

  was:
{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  

[jira] [Comment Edited] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835800#comment-16835800
 ] 

Henrik edited comment on FLINK-12381 at 5/8/19 6:34 PM:


Yes, you can see it like that (a new cluster), I suppose.

So does that mean that flink is useless without HA then? Because if I don't 
have HA, and the node I'm running it on, or the k8s pod I'm running it in, 
restarts, it's a new cluster?

In the optimal world, I would not have to manually change the specification of 
the job that runs, without the job that runs also having been changed. I.e. it 
goes against declarative running of resources in a k8s cluster to manually have 
to change the jobid whenever the pod is restarted.


was (Author: haf):
Yes, you can see it like that (a new cluster), I suppose.

So does that mean that flink is useless without HA then? Because if I don't 
have HA, and the node I'm running it on, or the k8s pod I'm running it in, 
restarts, it's a new cluster?

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835505#comment-16835505
 ] 

Henrik edited comment on FLINK-12379 at 5/8/19 6:29 PM:


Yes, it's the "standalone-job.sh" entrypoint. [~StephanEwen]

AFAIK it's the recommended way to deploy a standalone job?


was (Author: haf):
Yes, it's the "standalone-job.sh" entrypoint. [~StephanEwen]

> Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint
> 
>
> Key: FLINK-12379
> URL: https://issues.apache.org/jira/browse/FLINK-12379
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: GCS +
>  
> {code:java}
> 1.8.0
> 1.8
> 2.11{code}
> {code:java}
> 
> 
> 
> 
>   com.google.cloud.bigdataoss
>   gcs-connector
>   hadoop2-1.9.16
> 
> 
>   org.apache.flink
>   flink-connector-filesystem_2.11
>   ${flink.version}
> 
> 
>   org.apache.flink
>   flink-hadoop-fs
>   ${flink.version}
> 
> 
> 
>   org.apache.flink
>   flink-shaded-hadoop2
>   ${hadoop.version}-${flink.version}
> 
> {code}
>  
>  
>Reporter: Henrik
>Priority: Major
>
> When running one standalone-job w/ parallelism=1 + one taskmanager, you will 
> shortly get this crash
> {code:java}
> 2019-04-30 22:20:02,928 WARN  org.apache.flink.runtime.jobmaster.JobMaster
>   - Error while processing checkpoint acknowledgement message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize 
> the pending checkpoint 5.
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676)
>     at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>     at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-5/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> Caused by: java.nio.file.FileAlreadyExistsException: Object 
> gs://example_bucket/flink/checkpoints//chk-5/_metadata
>  already exists.
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82)
>     ... 19 more
> 2019-04-30 22:20:03,114 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 6 @ 

[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835800#comment-16835800
 ] 

Henrik commented on FLINK-12381:


Yes, you can see it like that (a new cluster), I suppose.

So does that mean that flink is useless without HA then? Because if I don't 
have HA, and the node I'm running it on, or the k8s pod I'm running it in, 
restarts, it's a new cluster?

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835503#comment-16835503
 ] 

Henrik commented on FLINK-12376:


> Either [~haf] used a version of the connector that did not have this fix 
> (could you confirm [~haf]?)

I'm using the latest version; the one you pinged me and said that you had a fix 
for some exceptions in (the one that you said would work with 1.7.x)

This is the code that was running in this issue.

!Screenshot 2019-05-08 at 12.41.07.png!

> GCS runtime exn: Request payload size exceeds the limit
> ---
>
> Key: FLINK-12376
> URL: https://issues.apache.org/jira/browse/FLINK-12376
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Google Cloud PubSub
>Affects Versions: 1.7.2
> Environment: FROM flink:1.8.0-scala_2.11
> ARG version=0.17
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib
> COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar
>Reporter: Henrik
>Assignee: Richard Deurwaarder
>Priority: Major
> Attachments: Screenshot 2019-04-30 at 22.32.34.png, Screenshot 
> 2019-05-08 at 12.41.07.png
>
>
> I'm trying to use the google cloud storage file system, but it would seem 
> that the FLINK / GCS client libs are creating too-large requests far down in 
> the GCS Java client.
> The Java client is added to the lib folder with this command in Dockerfile 
> (probably 
> [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar]
>  at the time of writing):
>  
> {code:java}
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib{code}
> This is the crash output. Focus lines:
> {code:java}
> java.lang.RuntimeException: Error while confirming checkpoint{code}
> and
> {code:java}
>  Caused by: com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.{code}
> Full stacktrace:
>  
> {code:java}
> [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO  
> org.apache.flink.runtime.taskmanager.Task - Source: 
> Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) 
> (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED.
> [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error 
> while confirming checkpoint
> [analytics-867c867ff6-l622h taskmanager]     at 
> org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.lang.Thread.run(Thread.java:748)
> [analytics-867c867ff6-l622h taskmanager] Caused by: 
> com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
> [analytics-867c867ff6-l622h taskmanager]     at 
> 

[jira] [Comment Edited] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835505#comment-16835505
 ] 

Henrik edited comment on FLINK-12379 at 5/8/19 10:43 AM:
-

Yes, it's the "standalone-job.sh" entrypoint. [~StephanEwen]


was (Author: haf):
Yes, it's the "standalone-job.sh" entrypoint.

> Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint
> 
>
> Key: FLINK-12379
> URL: https://issues.apache.org/jira/browse/FLINK-12379
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: GCS +
>  
> {code:java}
> 1.8.0
> 1.8
> 2.11{code}
> {code:java}
> 
> 
> 
> 
>   com.google.cloud.bigdataoss
>   gcs-connector
>   hadoop2-1.9.16
> 
> 
>   org.apache.flink
>   flink-connector-filesystem_2.11
>   ${flink.version}
> 
> 
>   org.apache.flink
>   flink-hadoop-fs
>   ${flink.version}
> 
> 
> 
>   org.apache.flink
>   flink-shaded-hadoop2
>   ${hadoop.version}-${flink.version}
> 
> {code}
>  
>  
>Reporter: Henrik
>Priority: Major
>
> When running one standalone-job w/ parallelism=1 + one taskmanager, you will 
> shortly get this crash
> {code:java}
> 2019-04-30 22:20:02,928 WARN  org.apache.flink.runtime.jobmaster.JobMaster
>   - Error while processing checkpoint acknowledgement message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize 
> the pending checkpoint 5.
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676)
>     at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>     at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-5/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> Caused by: java.nio.file.FileAlreadyExistsException: Object 
> gs://example_bucket/flink/checkpoints//chk-5/_metadata
>  already exists.
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82)
>     ... 19 more
> 2019-04-30 22:20:03,114 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 6 @ 1556662802928 for job .{code}
> My guess at why; 

[jira] [Commented] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835505#comment-16835505
 ] 

Henrik commented on FLINK-12379:


Yes, it's the "standalone-job.sh" entrypoint.

> Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint
> 
>
> Key: FLINK-12379
> URL: https://issues.apache.org/jira/browse/FLINK-12379
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: GCS +
>  
> {code:java}
> 1.8.0
> 1.8
> 2.11{code}
> {code:java}
> 
> 
> 
> 
>   com.google.cloud.bigdataoss
>   gcs-connector
>   hadoop2-1.9.16
> 
> 
>   org.apache.flink
>   flink-connector-filesystem_2.11
>   ${flink.version}
> 
> 
>   org.apache.flink
>   flink-hadoop-fs
>   ${flink.version}
> 
> 
> 
>   org.apache.flink
>   flink-shaded-hadoop2
>   ${hadoop.version}-${flink.version}
> 
> {code}
>  
>  
>Reporter: Henrik
>Priority: Major
>
> When running one standalone-job w/ parallelism=1 + one taskmanager, you will 
> shortly get this crash
> {code:java}
> 2019-04-30 22:20:02,928 WARN  org.apache.flink.runtime.jobmaster.JobMaster
>   - Error while processing checkpoint acknowledgement message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize 
> the pending checkpoint 5.
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676)
>     at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>     at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-5/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> Caused by: java.nio.file.FileAlreadyExistsException: Object 
> gs://example_bucket/flink/checkpoints//chk-5/_metadata
>  already exists.
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286)
>     at 
> com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82)
>     ... 19 more
> 2019-04-30 22:20:03,114 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 6 @ 1556662802928 for job .{code}
> My guess at why; concurrent checkpoint writers are updating the _metadata 
> resource concurrently. They should be using optimistic concurrency 

[jira] [Updated] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit

2019-05-08 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12376:
---
Attachment: Screenshot 2019-05-08 at 12.41.07.png

> GCS runtime exn: Request payload size exceeds the limit
> ---
>
> Key: FLINK-12376
> URL: https://issues.apache.org/jira/browse/FLINK-12376
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Google Cloud PubSub
>Affects Versions: 1.7.2
> Environment: FROM flink:1.8.0-scala_2.11
> ARG version=0.17
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib
> COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar
>Reporter: Henrik
>Assignee: Richard Deurwaarder
>Priority: Major
> Attachments: Screenshot 2019-04-30 at 22.32.34.png, Screenshot 
> 2019-05-08 at 12.41.07.png
>
>
> I'm trying to use the google cloud storage file system, but it would seem 
> that the FLINK / GCS client libs are creating too-large requests far down in 
> the GCS Java client.
> The Java client is added to the lib folder with this command in Dockerfile 
> (probably 
> [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar]
>  at the time of writing):
>  
> {code:java}
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib{code}
> This is the crash output. Focus lines:
> {code:java}
> java.lang.RuntimeException: Error while confirming checkpoint{code}
> and
> {code:java}
>  Caused by: com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.{code}
> Full stacktrace:
>  
> {code:java}
> [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO  
> org.apache.flink.runtime.taskmanager.Task - Source: 
> Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) 
> (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED.
> [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error 
> while confirming checkpoint
> [analytics-867c867ff6-l622h taskmanager]     at 
> org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.lang.Thread.run(Thread.java:748)
> [analytics-867c867ff6-l622h taskmanager] Caused by: 
> com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
> [analytics-867c867ff6-l622h taskmanager]     at 
> io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
> [analytics-867c867ff6-l622h taskmanager]     at 
> io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
> [analytics-867c867ff6-l622h taskmanager]     at 
> 

[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-08 Thread Henrik (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835507#comment-16835507
 ] 

Henrik commented on FLINK-12381:


It's just a single "cluster", so I don't think what you write above applies. 
There's only a single app/job. So I don't see how setting a job id helps.

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Coordination
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.
>  
> Repro:
>  # Set up a single purpose job cluster (which could use much better docs btw!)
>  # Let it run with GCS checkpointing for a while with rocksdb/gs://example
>  # Kill it
>  # Start it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

2019-05-01 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12384:
---
Description: 
{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Using 
configured hostname/address for TaskManager: 10.1.2.173.
[tm] 2019-05-01 13:30:53,401 ERROR 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
Authentication failed
[tm] 2019-05-01 13:30:53,418 INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start 
actor system at 10.1.2.173:0
[tm] 2019-05-01 13:30:53,420 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session
[tm] 2019-05-01 13:30:53,500 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
Connected to an old server; r-o mode will be unavailable
[tm] 2019-05-01 13:30:53,500 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
0xbf06a739001d446, negotiated timeout = 6
[tm] 2019-05-01 13:30:53,525 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: CONNECTED{code}
Repro:

Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd 
in front. Configure the sesssion cluster to go against zetcd.

Ensure the job can start successfully.

Now, kill the etcd pods one by one, letting the quorum re-establish in between, 
so that the cluster is still OK.

Now restart the job/tm pods. You'll end up in this no-mans-land.

 

---

Workaround: clean out the etcd cluster and remove all its data, however, this 
resets all time windows and state, despite having that saved in GCS, so it's a 
crappy workaround.

  was:
{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Using 
configured hostname/address for TaskManager: 10.1.2.173.
[tm] 2019-05-01 13:30:53,401 ERROR 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
Authentication failed
[tm] 2019-05-01 13:30:53,418 INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start 
actor system at 10.1.2.173:0
[tm] 2019-05-01 13:30:53,420 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session
[tm] 2019-05-01 13:30:53,500 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
Connected to an old server; r-o mode will be unavailable
[tm] 2019-05-01 13:30:53,500 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
0xbf06a739001d446, negotiated timeout = 6
[tm] 2019-05-01 13:30:53,525 INFO  

[jira] [Created] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"

2019-05-01 Thread Henrik (JIRA)
Henrik created FLINK-12384:
--

 Summary: Rolling the etcd servers causes "Connected to an old 
server; r-o mode will be unavailable"
 Key: FLINK-12384
 URL: https://issues.apache.org/jira/browse/FLINK-12384
 Project: Flink
  Issue Type: Bug
Reporter: Henrik


{code:java}
[tm] 2019-05-01 13:30:53,316 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Initiating 
client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f
[tm] 2019-05-01 13:30:53,384 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL 
configuration failed: javax.security.auth.login.LoginException: No JAAS 
configuration section named 'Client' was found in specified JAAS configuration 
file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181
[tm] 2019-05-01 13:30:53,395 INFO  
org.apache.flink.runtime.taskexecutor.TaskManagerRunner   - Using 
configured hostname/address for TaskManager: 10.1.2.173.
[tm] 2019-05-01 13:30:53,401 ERROR 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
Authentication failed
[tm] 2019-05-01 13:30:53,418 INFO  
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start 
actor system at 10.1.2.173:0
[tm] 2019-05-01 13:30:53,420 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session
[tm] 2019-05-01 13:30:53,500 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket  - 
Connected to an old server; r-o mode will be unavailable
[tm] 2019-05-01 13:30:53,500 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 
0xbf06a739001d446, negotiated timeout = 6
[tm] 2019-05-01 13:30:53,525 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: CONNECTED{code}
Repro:

Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd 
in front. Configure the sesssion cluster to go against zetcd.

Ensure the job can start successfully.

Now, kill the etcd pods one by one, letting the quorum re-establish in between, 
so that the cluster is still OK.

Now restart the job/tm pods. You'll end up in this no-mans-land.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12383) "Log file environment variable 'log.file' is not set" despite web.log.path being set

2019-05-01 Thread Henrik (JIRA)
Henrik created FLINK-12383:
--

 Summary: "Log file environment variable 'log.file' is not set" 
despite web.log.path being set
 Key: FLINK-12383
 URL: https://issues.apache.org/jira/browse/FLINK-12383
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Affects Versions: 1.8.0
Reporter: Henrik


You get these warnings when starting a session cluster, despite having 
configured all things logs as specified by the configuration reference on the 
[web 
site|https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#web-frontend]:
{code:java}
[job] 2019-05-01 13:25:35,418 WARN  
org.apache.flink.runtime.webmonitor.WebMonitorUtils   - Log file 
environment variable 'log.file' is not set.
[job] 2019-05-01 13:25:35,419 INFO  
org.apache.flink.runtime.webmonitor.WebMonitorUtils   - Determined 
location of main cluster component log file: /var/lib/log/flink/jobmanager.log
[job] 2019-05-01 13:25:35,419 INFO  
org.apache.flink.runtime.webmonitor.WebMonitorUtils   - Determined 
location of main cluster component stdout file: 
/var/lib/log/flink/jobmanager.out
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12382) HA + ResourceManager exception: Fencing token not set

2019-05-01 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12382:
---
Description: 
I'm testing zetcd + session jobs in k8s, and testing what happens when I kill 
both the job-cluster and task-manager at the same time, but maintain ZK/zetcd 
up and running.

Then I get this stacktrace, that's completely non-actionable for me, and also 
resolves itself. I expect a number of retries, and if this exception is part of 
the protocol signalling to retry, then it should not be printed as a log entry.

This might be related to an older bug: 
[https://jira.apache.org/jira/browse/FLINK-7734]
{code:java}
[tm] 2019-05-01 11:32:01,641 ERROR 
org.apache.flink.runtime.taskexecutor.TaskExecutor    - Registration at 
ResourceManager failed due to an error
[tm] java.util.concurrent.CompletionException: 
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
HardwareDescription, Time))) sent to 
akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
token is null.
[tm]     at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
[tm]     at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
[tm]     at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
[tm]     at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
[tm]     at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
[tm]     at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
[tm]     at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
[tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
[tm]     at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
[tm]     at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
[tm]     at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
[tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
[tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
[tm]     at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: 
Fencing token not set: Ignoring message 
RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
HardwareDescription, Time))) sent to 
akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
token is null.
[tm]     at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
[tm]     at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
[tm]     at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     ... 9 more
[tm] 2019-05-01 11:32:01,650 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor    - Pausing and 
re-attempting registration in 1 ms
[tm] 2019-05-01 11:32:03,070 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor    - The heartbeat 
of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
[tm] 2019-05-01 11:32:03,070 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor    - Close 

[jira] [Updated] (FLINK-12382) HA + ResourceManager exception: Fencing token not set

2019-05-01 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12382:
---
Description: 
I'm testing zetcd + session jobs in k8s, and testing what happens when I kill 
both the job-cluster and task-manager at the same time, but maintain ZK/zetcd 
up and running.

Then I get this stacktrace, that's completely non-actionable for me, and also 
resolves itself. I expect a number of retries, and if this exception is part of 
the protocol signalling to retry, then it should not be printed as a log entry.

This might be related to an older bug: 
[https://jira.apache.org/jira/browse/FLINK-7734]
{code:java}
[tm] 2019-05-01 11:32:01,641 ERROR 
org.apache.flink.runtime.taskexecutor.TaskExecutor    - Registration at 
ResourceManager failed due to an error
[tm] java.util.concurrent.CompletionException: 
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
HardwareDescription, Time))) sent to 
akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
token is null.
[tm]     at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
[tm]     at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
[tm]     at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
[tm]     at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
[tm]     at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
[tm]     at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
[tm]     at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
[tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
[tm]     at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
[tm]     at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
[tm]     at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
[tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
[tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
[tm]     at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: 
Fencing token not set: Ignoring message 
RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
HardwareDescription, Time))) sent to 
akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
token is null.
[tm]     at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
[tm]     at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
[tm]     at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     ... 9 more
[tm] 2019-05-01 11:32:01,650 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor    - Pausing and 
re-attempting registration in 1 ms
[tm] 2019-05-01 11:32:03,070 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor    - The heartbeat 
of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
[tm] 2019-05-01 11:32:03,070 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor    - Close 

[jira] [Created] (FLINK-12382) HA + ResourceManager exception: Fencing token not set

2019-05-01 Thread Henrik (JIRA)
Henrik created FLINK-12382:
--

 Summary: HA + ResourceManager exception: Fencing token not set
 Key: FLINK-12382
 URL: https://issues.apache.org/jira/browse/FLINK-12382
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.8.0
 Environment: Same all all previous bugs filed by myself, today, but 
this time with HA with zetcd.
Reporter: Henrik


I'm testing zetcd + session jobs in k8s, and testing what happens when I kill 
both the job-cluster and task-manager at the same time, but maintain ZK/zetcd 
up and running.

Then I get this stacktrace, that's completely non-actionable for me, and also 
resolves itself. I expect a number of retries, and if this exception is part of 
the protocol signalling to retry, then it should not be printed as a log entry.

This might be related to an older bug: 
https://jira.apache.org/jira/browse/FLINK-7734

 

 
{code:java}
[tm] 2019-05-01 11:32:01,641 ERROR 
org.apache.flink.runtime.taskexecutor.TaskExecutor    - Registration at 
ResourceManager failed due to an error
[tm] java.util.concurrent.CompletionException: 
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
HardwareDescription, Time))) sent to 
akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
token is null.
[tm]     at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
[tm]     at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
[tm]     at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
[tm]     at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
[tm]     at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
[tm]     at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
[tm]     at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
[tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
[tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
[tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
[tm]     at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
[tm]     at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
[tm]     at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
[tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
[tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
[tm]     at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
[tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
[tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
[tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
[tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
[tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
[tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[tm]     at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: 
Fencing token not set: Ignoring message 
RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
HardwareDescription, Time))) sent to 
akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
token is null.
[tm]     at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
[tm]     at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
[tm]     at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
[tm]     at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
[tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
[tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
[tm]     ... 9 more
[tm] 2019-05-01 11:32:01,650 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor    - Pausing and 
re-attempting 

[jira] [Updated] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-01 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12381:
---
Description: 
{code:java}
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
 already exists
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
    at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
    ... 8 more
{code}
Instead, it should either just overwrite the checkpoint or fail to start the 
job completely. Partial and undefined failure is not what should happen.

 

Repro:
 # Set up a single purpose job cluster (which could use much better docs btw!)
 # Let it run with GCS checkpointing for a while with rocksdb/gs://example
 # Kill it
 # Start it

  was:
{code:java}
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
 already exists
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
    at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
    ... 8 more
{code}
Instead, it should either just overwrite the checkpoint or fail to start the 
job completely. Partial and undefined failure is not what should happen.


> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> 

[jira] [Updated] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes

2019-05-01 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12381:
---
Summary: W/o HA, upon a full restart, checkpointing crashes  (was: Without 
failover (aka "HA") configured, full restarts' checkpointing crashes)

> W/o HA, upon a full restart, checkpointing crashes
> --
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
>Reporter: Henrik
>Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> 'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
>  already exists
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
>     at 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
>     at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
>     at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
>     at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
>     at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
>     ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the 
> job completely. Partial and undefined failure is not what should happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12381) Without failover (aka "HA") configured, full restarts' checkpointing crashes

2019-05-01 Thread Henrik (JIRA)
Henrik created FLINK-12381:
--

 Summary: Without failover (aka "HA") configured, full restarts' 
checkpointing crashes
 Key: FLINK-12381
 URL: https://issues.apache.org/jira/browse/FLINK-12381
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.8.0
 Environment: Same as FLINK-\{12379, 12377, 12376}
Reporter: Henrik


{code:java}
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
'gs://example_bucket/flink/checkpoints//chk-16/_metadata'
 already exists
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
    at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
    ... 8 more
{code}
Instead, it should either just overwrite the checkpoint or fail to start the 
job completely. Partial and undefined failure is not what should happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint

2019-04-30 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12379:
---
Description: 
When running one standalone-job w/ parallelism=1 + one taskmanager, you will 
shortly get this crash
{code:java}
2019-04-30 22:20:02,928 WARN  org.apache.flink.runtime.jobmaster.JobMaster  
    - Error while processing checkpoint acknowledgement message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the 
pending checkpoint 5.
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837)
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756)
    at 
org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
'gs://example_bucket/flink/checkpoints//chk-5/_metadata'
 already exists
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
    at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
    ... 8 more
Caused by: java.nio.file.FileAlreadyExistsException: Object 
gs://example_bucket/flink/checkpoints//chk-5/_metadata
 already exists.
    at 
com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918)
    at 
com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410)
    at 
com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286)
    at 
com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82)
    ... 19 more
2019-04-30 22:20:03,114 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 6 @ 1556662802928 for job .{code}
My guess at why; concurrent checkpoint writers are updating the _metadata 
resource concurrently. They should be using optimistic concurrency control with 
ETag on GCS, and then retry until successful.

When running with parallelism=4, you always seem to get this, even after 
deleting all checkpoints:
{code:java}
[analytics-job-cluster-668886c96b-mhqmc job] 2019-04-30 22:50:35,175 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 
triggering task Source: Custom Source -> Process -> Timestamps/Watermarks -> 
our_events (1/4) of job  is not in state 
RUNNING but SCHEDULED instead. Aborting checkpoint.{code}
Or in short: with parallelism > 1, your Flink job never makes progress.

  was:
When running a standalone-job w/ parallelism=4 + taskmanager, you will shortly 
get this crash
{code:java}
2019-04-30 22:20:02,928 WARN  org.apache.flink.runtime.jobmaster.JobMaster  
    - Error while processing checkpoint acknowledgement message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the 
pending checkpoint 5.
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837)
    at 

[jira] [Created] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint

2019-04-30 Thread Henrik (JIRA)
Henrik created FLINK-12379:
--

 Summary: Parallelism in job/GCS/Hadoop: Could not finalize the 
pending checkpoint
 Key: FLINK-12379
 URL: https://issues.apache.org/jira/browse/FLINK-12379
 Project: Flink
  Issue Type: Bug
  Components: FileSystems
Affects Versions: 1.8.0
 Environment: GCS +

 
{code:java}
1.8.0
1.8
2.11{code}
{code:java}




  com.google.cloud.bigdataoss
  gcs-connector
  hadoop2-1.9.16


  org.apache.flink
  flink-connector-filesystem_2.11
  ${flink.version}


  org.apache.flink
  flink-hadoop-fs
  ${flink.version}



  org.apache.flink
  flink-shaded-hadoop2
  ${hadoop.version}-${flink.version}

{code}
 

 
Reporter: Henrik


When running a standalone-job w/ parallelism=4 + taskmanager, you will shortly 
get this crash
{code:java}
2019-04-30 22:20:02,928 WARN  org.apache.flink.runtime.jobmaster.JobMaster  
    - Error while processing checkpoint acknowledgement message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the 
pending checkpoint 5.
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837)
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756)
    at 
org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
'gs://example_bucket/flink/checkpoints//chk-5/_metadata'
 already exists
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
    at 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
    at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
    at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
    at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
    ... 8 more
Caused by: java.nio.file.FileAlreadyExistsException: Object 
gs://example_bucket/flink/checkpoints//chk-5/_metadata
 already exists.
    at 
com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918)
    at 
com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410)
    at 
com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286)
    at 
com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264)
    at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82)
    ... 19 more
2019-04-30 22:20:03,114 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 6 @ 1556662802928 for job .{code}
My guess at why; concurrent checkpoint writers are updating the _metadata 
resource concurrently. They should be using optimistic concurrency control with 
ETag on GCS, and then retry until successful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit

2019-04-30 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12376:
---
Environment: 
FROM flink:1.8.0-scala_2.11
ARG version=0.17
ADD 
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar 
/opt/flink/lib
COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar

  was:* k8s latest docker-for-desktop on macOS, and scala 2.11-compiled Flink


> GCS runtime exn: Request payload size exceeds the limit
> ---
>
> Key: FLINK-12376
> URL: https://issues.apache.org/jira/browse/FLINK-12376
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.7.2
> Environment: FROM flink:1.8.0-scala_2.11
> ARG version=0.17
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib
> COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar
>Reporter: Henrik
>Priority: Major
> Attachments: Screenshot 2019-04-30 at 22.32.34.png
>
>
> I'm trying to use the google cloud storage file system, but it would seem 
> that the FLINK / GCS client libs are creating too-large requests far down in 
> the GCS Java client.
> The Java client is added to the lib folder with this command in Dockerfile 
> (probably 
> [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar]
>  at the time of writing):
>  
> {code:java}
> ADD 
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
>  /opt/flink/lib{code}
> This is the crash output. Focus lines:
> {code:java}
> java.lang.RuntimeException: Error while confirming checkpoint{code}
> and
> {code:java}
>  Caused by: com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.{code}
> Full stacktrace:
>  
> {code:java}
> [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO  
> org.apache.flink.runtime.taskmanager.Task - Source: 
> Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) 
> (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED.
> [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error 
> while confirming checkpoint
> [analytics-867c867ff6-l622h taskmanager]     at 
> org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [analytics-867c867ff6-l622h taskmanager]     at 
> java.lang.Thread.run(Thread.java:748)
> [analytics-867c867ff6-l622h taskmanager] Caused by: 
> com.google.api.gax.rpc.InvalidArgumentException: 
> io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size 
> exceeds the limit: 524288 bytes.
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
> [analytics-867c867ff6-l622h taskmanager]     at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
> [analytics-867c867ff6-l622h taskmanager]     at 
> io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
> [analytics-867c867ff6-l622h 

[jira] [Created] (FLINK-12377) Outdated docs (Flink 0.10) for Google Compute Engine

2019-04-30 Thread Henrik (JIRA)
Henrik created FLINK-12377:
--

 Summary: Outdated docs (Flink 0.10) for Google Compute Engine
 Key: FLINK-12377
 URL: https://issues.apache.org/jira/browse/FLINK-12377
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.0
Reporter: Henrik


[https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/gce_setup.html]
 links to 
[https://github.com/GoogleCloudPlatform/bdutil/blob/master/extensions/flink/flink_env.sh]
 which uses ancient versions of Hadoop and Flink.

Also the barrier to a newcomer is that bdutil itself is deprecated and the 
readme recommends DataFlow instead.

Furthermore, perhaps it would be wise to include GCP in the built-in 
filesystems in 
[https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/filesystems.html?|https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/filesystems.html]

Further, 
[https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/filesystems.html#hdfs-and-hadoop-file-system-support]
 doesn't actually link to any of the other configuration pages for thees other 
hadoop-based filesystems, nor does it explain how what exact library needs to 
be in the flink `lib folder; so it's really hard to go any further from there.

Lastly, it would seem the 1.8.0 release has stopped shipping the Hadoop libs, 
without documenting the change required, e.g. in the Hadoop File System page 
linked above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit

2019-04-30 Thread Henrik (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrik updated FLINK-12376:
---
Description: 
I'm trying to use the google cloud storage file system, but it would seem that 
the FLINK / GCS client libs are creating too-large requests far down in the GCS 
Java client.

The Java client is added to the lib folder with this command in Dockerfile 
(probably 
[hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar]
 at the time of writing):

 
{code:java}
ADD 
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar 
/opt/flink/lib{code}
This is the crash output. Focus lines:
{code:java}
java.lang.RuntimeException: Error while confirming checkpoint{code}
and
{code:java}
 Caused by: com.google.api.gax.rpc.InvalidArgumentException: 
io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds 
the limit: 524288 bytes.{code}
Full stacktrace:

 
{code:java}
[analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO  
org.apache.flink.runtime.taskmanager.Task - Source: Custom 
Source -> Process -> Timestamps/Watermarks -> app_events (1/1) 
(9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED.
[analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error 
while confirming checkpoint
[analytics-867c867ff6-l622h taskmanager]     at 
org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211)
[analytics-867c867ff6-l622h taskmanager]     at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[analytics-867c867ff6-l622h taskmanager]     at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[analytics-867c867ff6-l622h taskmanager]     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[analytics-867c867ff6-l622h taskmanager]     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[analytics-867c867ff6-l622h taskmanager]     at 
java.lang.Thread.run(Thread.java:748)
[analytics-867c867ff6-l622h taskmanager] Caused by: 
com.google.api.gax.rpc.InvalidArgumentException: 
io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds 
the limit: 524288 bytes.
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)

[jira] [Created] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit

2019-04-30 Thread Henrik (JIRA)
Henrik created FLINK-12376:
--

 Summary: GCS runtime exn: Request payload size exceeds the limit
 Key: FLINK-12376
 URL: https://issues.apache.org/jira/browse/FLINK-12376
 Project: Flink
  Issue Type: Bug
  Components: FileSystems
Affects Versions: 1.7.2
 Environment: * k8s latest docker-for-desktop on macOS, and scala 
2.11-compiled Flink
Reporter: Henrik
 Attachments: Screenshot 2019-04-30 at 22.32.34.png

I'm trying to use the google cloud storage file system, but it would seem that 
the FLINK / GCS client libs are creating too-large requests far down in the GCS 
Java client.

 
{code:java}
[analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO  
org.apache.flink.runtime.taskmanager.Task - Source: Custom 
Source -> Process -> Timestamps/Watermarks -> app_events (1/1) 
(9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED.
[analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error 
while confirming checkpoint
[analytics-867c867ff6-l622h taskmanager]     at 
org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211)
[analytics-867c867ff6-l622h taskmanager]     at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[analytics-867c867ff6-l622h taskmanager]     at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[analytics-867c867ff6-l622h taskmanager]     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[analytics-867c867ff6-l622h taskmanager]     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[analytics-867c867ff6-l622h taskmanager]     at 
java.lang.Thread.run(Thread.java:748)
[analytics-867c867ff6-l622h taskmanager] Caused by: 
com.google.api.gax.rpc.InvalidArgumentException: 
io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds 
the limit: 524288 bytes.
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
[analytics-867c867ff6-l622h taskmanager]     at 
com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
[analytics-867c867ff6-l622h taskmanager]     at 
io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
[analytics-867c867ff6-l622h taskmanager]     at