[jira] [Commented] (FLINK-12382) HA + ResourceManager exception: Fencing token not set
[ https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478579#comment-17478579 ] Henrik commented on FLINK-12382: Excellent, thank you! > HA + ResourceManager exception: Fencing token not set > - > > Key: FLINK-12382 > URL: https://issues.apache.org/jira/browse/FLINK-12382 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.0 > Environment: Same all all previous bugs filed by myself, today, but > this time with HA with zetcd. >Reporter: Henrik >Priority: Major > Attachments: jobmanager_log, jobmanager_stdout > > > I'm testing zetcd + session jobs in k8s, and testing what happens when I kill > both the job-cluster and task-manager at the same time, but maintain ZK/zetcd > up and running. > Then I get this stacktrace, that's completely non-actionable for me, and also > resolves itself. I expect a number of retries, and if this exception is part > of the protocol signalling to retry, then it should not be printed as a log > entry. > This might be related to an older bug: > [https://jira.apache.org/jira/browse/FLINK-7734] > {code:java} > [tm] 2019-05-01 11:32:01,641 ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration > at ResourceManager failed due to an error > [tm] java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > not set: Ignoring message > RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, > RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, > HardwareDescription, Time))) sent to > akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing > token is null. > [tm] at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > [tm] at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > [tm] at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > [tm] at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > [tm] at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > [tm] at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > [tm] at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815) > [tm] at akka.dispatch.OnComplete.internal(Future.scala:258) > [tm] at akka.dispatch.OnComplete.internal(Future.scala:256) > [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > [tm] at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) > [tm] at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > [tm] at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) > [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) > [tm] at > akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982) > [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) > [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) > [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) > [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > [tm] at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [tm] at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [tm] Caused by: > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > not set: Ignoring message > RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, > RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, > HardwareDescription, Time))) sent to > akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing > token is null. > [tm] at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) > [tm] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) > [tm]
[jira] [Resolved] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit
[ https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik resolved FLINK-12376. Resolution: Fixed Seens PR has been made: in any case, this is a stale issue. > GCS runtime exn: Request payload size exceeds the limit > --- > > Key: FLINK-12376 > URL: https://issues.apache.org/jira/browse/FLINK-12376 > Project: Flink > Issue Type: Bug > Components: Connectors / Google Cloud PubSub >Affects Versions: 1.7.2 > Environment: FROM flink:1.8.0-scala_2.11 > ARG version=0.17 > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib > COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar >Reporter: Henrik >Priority: Not a Priority > Labels: auto-deprioritized-major, auto-deprioritized-minor, > auto-unassigned > Attachments: Screenshot 2019-04-30 at 22.32.34.png, Screenshot > 2019-05-08 at 12.41.07.png > > > I'm trying to use the google cloud storage file system, but it would seem > that the FLINK / GCS client libs are creating too-large requests far down in > the GCS Java client. > The Java client is added to the lib folder with this command in Dockerfile > (probably > [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar] > at the time of writing): > > {code:java} > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib{code} > This is the crash output. Focus lines: > {code:java} > java.lang.RuntimeException: Error while confirming checkpoint{code} > and > {code:java} > Caused by: com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes.{code} > Full stacktrace: > > {code:java} > [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO > org.apache.flink.runtime.taskmanager.Task - Source: > Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) > (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED. > [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error > while confirming checkpoint > [analytics-867c867ff6-l622h taskmanager] at > org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [analytics-867c867ff6-l622h taskmanager] at > java.lang.Thread.run(Thread.java:748) > [analytics-867c867ff6-l622h taskmanager] Caused by: > com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes. > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748) > [analytics-867c867ff6-l622h taskmanager] at > io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507) > [analytics-867c867ff6-l622h taskmanager] at >
[jira] [Commented] (FLINK-12382) HA + ResourceManager exception: Fencing token not set
[ https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478562#comment-17478562 ] Henrik commented on FLINK-12382: Do you support running Flink without ZooKeeper now, e.g. with k8s leader election? > HA + ResourceManager exception: Fencing token not set > - > > Key: FLINK-12382 > URL: https://issues.apache.org/jira/browse/FLINK-12382 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.8.0 > Environment: Same all all previous bugs filed by myself, today, but > this time with HA with zetcd. >Reporter: Henrik >Priority: Major > Attachments: jobmanager_log, jobmanager_stdout > > > I'm testing zetcd + session jobs in k8s, and testing what happens when I kill > both the job-cluster and task-manager at the same time, but maintain ZK/zetcd > up and running. > Then I get this stacktrace, that's completely non-actionable for me, and also > resolves itself. I expect a number of retries, and if this exception is part > of the protocol signalling to retry, then it should not be printed as a log > entry. > This might be related to an older bug: > [https://jira.apache.org/jira/browse/FLINK-7734] > {code:java} > [tm] 2019-05-01 11:32:01,641 ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration > at ResourceManager failed due to an error > [tm] java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > not set: Ignoring message > RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, > RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, > HardwareDescription, Time))) sent to > akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing > token is null. > [tm] at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > [tm] at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > [tm] at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > [tm] at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > [tm] at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > [tm] at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > [tm] at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815) > [tm] at akka.dispatch.OnComplete.internal(Future.scala:258) > [tm] at akka.dispatch.OnComplete.internal(Future.scala:256) > [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > [tm] at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) > [tm] at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > [tm] at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) > [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) > [tm] at > akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982) > [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) > [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) > [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) > [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > [tm] at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [tm] at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [tm] Caused by: > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > not set: Ignoring message > RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, > RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, > HardwareDescription, Time))) sent to > akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing > token is null. > [tm] at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) > [tm] at >
[jira] [Commented] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit
[ https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359126#comment-17359126 ] Henrik commented on FLINK-12376: It's messed up that you let this linger for two years before acting on it and when you act on it, you let a bot do it. > GCS runtime exn: Request payload size exceeds the limit > --- > > Key: FLINK-12376 > URL: https://issues.apache.org/jira/browse/FLINK-12376 > Project: Flink > Issue Type: Bug > Components: Connectors / Google Cloud PubSub >Affects Versions: 1.7.2 > Environment: FROM flink:1.8.0-scala_2.11 > ARG version=0.17 > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib > COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar >Reporter: Henrik >Priority: Major > Labels: auto-unassigned, stale-major > Attachments: Screenshot 2019-04-30 at 22.32.34.png, Screenshot > 2019-05-08 at 12.41.07.png > > > I'm trying to use the google cloud storage file system, but it would seem > that the FLINK / GCS client libs are creating too-large requests far down in > the GCS Java client. > The Java client is added to the lib folder with this command in Dockerfile > (probably > [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar] > at the time of writing): > > {code:java} > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib{code} > This is the crash output. Focus lines: > {code:java} > java.lang.RuntimeException: Error while confirming checkpoint{code} > and > {code:java} > Caused by: com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes.{code} > Full stacktrace: > > {code:java} > [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO > org.apache.flink.runtime.taskmanager.Task - Source: > Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) > (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED. > [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error > while confirming checkpoint > [analytics-867c867ff6-l622h taskmanager] at > org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [analytics-867c867ff6-l622h taskmanager] at > java.lang.Thread.run(Thread.java:748) > [analytics-867c867ff6-l622h taskmanager] Caused by: > com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes. > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748) > [analytics-867c867ff6-l622h taskmanager] at > io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507) > [analytics-867c867ff6-l622h taskmanager] at >
[jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
[ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842012#comment-16842012 ] Henrik commented on FLINK-12384: It's misbehaving in all different sorts of ways (filter by my reported bugs); but not in any way that I can deduce this as the cause of. Then this ticket is about removing the warning, as it's not a warning (but info). > Rolling the etcd servers causes "Connected to an old server; r-o mode will be > unavailable" > -- > > Key: FLINK-12384 > URL: https://issues.apache.org/jira/browse/FLINK-12384 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Reporter: Henrik >Priority: Major > > {code:java} > [tm] 2019-05-01 13:30:53,316 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - > Initiating client connection, connectString=analytics-zetcd:2181 > sessionTimeout=6 > watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f > [tm] 2019-05-01 13:30:53,384 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in specified JAAS > configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. > [tm] 2019-05-01 13:30:53,395 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening > socket connection to server > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 > [tm] 2019-05-01 13:30:53,395 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using > configured hostname/address for TaskManager: 10.1.2.173. > [tm] 2019-05-01 13:30:53,401 ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed > [tm] 2019-05-01 13:30:53,418 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to > start actor system at 10.1.2.173:0 > [tm] 2019-05-01 13:30:53,420 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket > connection established to > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating > session > [tm] 2019-05-01 13:30:53,500 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - > Connected to an old server; r-o mode will be unavailable > [tm] 2019-05-01 13:30:53,500 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session > establishment complete on server > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = > 0xbf06a739001d446, negotiated timeout = 6 > [tm] 2019-05-01 13:30:53,525 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: CONNECTED{code} > Repro: > Start an etcd-cluster, with e.g. etcd-operator, with three members. Start > zetcd in front. Configure the sesssion cluster to go against zetcd. > Ensure the job can start successfully. > Now, kill the etcd pods one by one, letting the quorum re-establish in > between, so that the cluster is still OK. > Now restart the job/tm pods. You'll end up in this no-mans-land. > > — > Workaround: clean out the etcd cluster and remove all its data, however, this > resets all time windows and state, despite having that saved in GCS, so it's > a crappy workaround. > > – > > flink-conf.yaml > {code:java} > parallelism.default: 1 > rest.address: analytics-job > jobmanager.rpc.address: analytics-job # = resource manager's address too > jobmanager.heap.size: 1024m > jobmanager.rpc.port: 6123 > jobmanager.slot.request.timeout: 3 > resourcemanager.rpc.port: 6123 > high-availability.jobmanager.port: 6123 > blob.server.port: 6124 > queryable-state.server.ports: 6125 > taskmanager.heap.size: 1024m > taskmanager.numberOfTaskSlots: 1 > web.log.path: /var/lib/log/flink/jobmanager.log > rest.port: 8081 > rest.bind-address: 0.0.0.0 > web.submit.enable: false > high-availability: zookeeper > high-availability.storageDir: gs://example_analytics/flink/zetcd/ > high-availability.zookeeper.quorum: analytics-zetcd:2181 > high-availability.zookeeper.path.root: /flink > high-availability.zookeeper.client.acl: open > state.backend: rocksdb > state.checkpoints.num-retained: 3 > state.checkpoints.dir: gs://example_analytics/flink/checkpoints > state.savepoints.dir: gs://example_analytics/flink/savepoints > state.backend.incremental: true > state.backend.async: true > fs.hdfs.hadoopconf: /opt/flink/hadoop > log.file: /var/lib/log/flink/jobmanager.log{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
[ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841391#comment-16841391 ] Henrik commented on FLINK-12384: zetcd refers to the project by the name [https://github.com/helm/charts/tree/master/stable/zetcd/] I'm not running ZK at all, but etcd. > Rolling the etcd servers causes "Connected to an old server; r-o mode will be > unavailable" > -- > > Key: FLINK-12384 > URL: https://issues.apache.org/jira/browse/FLINK-12384 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Reporter: Henrik >Priority: Major > > {code:java} > [tm] 2019-05-01 13:30:53,316 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - > Initiating client connection, connectString=analytics-zetcd:2181 > sessionTimeout=6 > watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f > [tm] 2019-05-01 13:30:53,384 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in specified JAAS > configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. > [tm] 2019-05-01 13:30:53,395 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening > socket connection to server > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 > [tm] 2019-05-01 13:30:53,395 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using > configured hostname/address for TaskManager: 10.1.2.173. > [tm] 2019-05-01 13:30:53,401 ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed > [tm] 2019-05-01 13:30:53,418 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to > start actor system at 10.1.2.173:0 > [tm] 2019-05-01 13:30:53,420 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket > connection established to > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating > session > [tm] 2019-05-01 13:30:53,500 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - > Connected to an old server; r-o mode will be unavailable > [tm] 2019-05-01 13:30:53,500 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session > establishment complete on server > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = > 0xbf06a739001d446, negotiated timeout = 6 > [tm] 2019-05-01 13:30:53,525 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: CONNECTED{code} > Repro: > Start an etcd-cluster, with e.g. etcd-operator, with three members. Start > zetcd in front. Configure the sesssion cluster to go against zetcd. > Ensure the job can start successfully. > Now, kill the etcd pods one by one, letting the quorum re-establish in > between, so that the cluster is still OK. > Now restart the job/tm pods. You'll end up in this no-mans-land. > > — > Workaround: clean out the etcd cluster and remove all its data, however, this > resets all time windows and state, despite having that saved in GCS, so it's > a crappy workaround. > > – > > flink-conf.yaml > {code:java} > parallelism.default: 1 > rest.address: analytics-job > jobmanager.rpc.address: analytics-job # = resource manager's address too > jobmanager.heap.size: 1024m > jobmanager.rpc.port: 6123 > jobmanager.slot.request.timeout: 3 > resourcemanager.rpc.port: 6123 > high-availability.jobmanager.port: 6123 > blob.server.port: 6124 > queryable-state.server.ports: 6125 > taskmanager.heap.size: 1024m > taskmanager.numberOfTaskSlots: 1 > web.log.path: /var/lib/log/flink/jobmanager.log > rest.port: 8081 > rest.bind-address: 0.0.0.0 > web.submit.enable: false > high-availability: zookeeper > high-availability.storageDir: gs://example_analytics/flink/zetcd/ > high-availability.zookeeper.quorum: analytics-zetcd:2181 > high-availability.zookeeper.path.root: /flink > high-availability.zookeeper.client.acl: open > state.backend: rocksdb > state.checkpoints.num-retained: 3 > state.checkpoints.dir: gs://example_analytics/flink/checkpoints > state.savepoints.dir: gs://example_analytics/flink/savepoints > state.backend.incremental: true > state.backend.async: true > fs.hdfs.hadoopconf: /opt/flink/hadoop > log.file: /var/lib/log/flink/jobmanager.log{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
[ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12384: --- Description: {code:java} [tm] 2019-05-01 13:30:53,316 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f [tm] 2019-05-01 13:30:53,384 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using configured hostname/address for TaskManager: 10.1.2.173. [tm] 2019-05-01 13:30:53,401 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed [tm] 2019-05-01 13:30:53,418 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at 10.1.2.173:0 [tm] 2019-05-01 13:30:53,420 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session [tm] 2019-05-01 13:30:53,500 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable [tm] 2019-05-01 13:30:53,500 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 0xbf06a739001d446, negotiated timeout = 6 [tm] 2019-05-01 13:30:53,525 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED{code} Repro: Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd in front. Configure the sesssion cluster to go against zetcd. Ensure the job can start successfully. Now, kill the etcd pods one by one, letting the quorum re-establish in between, so that the cluster is still OK. Now restart the job/tm pods. You'll end up in this no-mans-land. — Workaround: clean out the etcd cluster and remove all its data, however, this resets all time windows and state, despite having that saved in GCS, so it's a crappy workaround. – flink-conf.yaml {code:java} parallelism.default: 1 rest.address: analytics-job jobmanager.rpc.address: analytics-job # = resource manager's address too jobmanager.heap.size: 1024m jobmanager.rpc.port: 6123 jobmanager.slot.request.timeout: 3 resourcemanager.rpc.port: 6123 high-availability.jobmanager.port: 6123 blob.server.port: 6124 queryable-state.server.ports: 6125 taskmanager.heap.size: 1024m taskmanager.numberOfTaskSlots: 1 web.log.path: /var/lib/log/flink/jobmanager.log rest.port: 8081 rest.bind-address: 0.0.0.0 web.submit.enable: false high-availability: zookeeper high-availability.storageDir: gs://example_analytics/flink/zetcd/ high-availability.zookeeper.quorum: analytics-zetcd:2181 high-availability.zookeeper.path.root: /flink high-availability.zookeeper.client.acl: open state.backend: rocksdb state.checkpoints.num-retained: 3 state.checkpoints.dir: gs://example_analytics/flink/checkpoints state.savepoints.dir: gs://example_analytics/flink/savepoints state.backend.incremental: true state.backend.async: true fs.hdfs.hadoopconf: /opt/flink/hadoop log.file: /var/lib/log/flink/jobmanager.log{code} was: {code:java} [tm] 2019-05-01 13:30:53,316 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f [tm] 2019-05-01 13:30:53,384 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 [tm] 2019-05-01 13:30:53,395
[jira] [Commented] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
[ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841133#comment-16841133 ] Henrik commented on FLINK-12384: Thanks for hte ping [~gjy] I've updated the issue with that information. > Rolling the etcd servers causes "Connected to an old server; r-o mode will be > unavailable" > -- > > Key: FLINK-12384 > URL: https://issues.apache.org/jira/browse/FLINK-12384 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Reporter: Henrik >Priority: Major > > {code:java} > [tm] 2019-05-01 13:30:53,316 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - > Initiating client connection, connectString=analytics-zetcd:2181 > sessionTimeout=6 > watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f > [tm] 2019-05-01 13:30:53,384 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in specified JAAS > configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. > [tm] 2019-05-01 13:30:53,395 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening > socket connection to server > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 > [tm] 2019-05-01 13:30:53,395 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using > configured hostname/address for TaskManager: 10.1.2.173. > [tm] 2019-05-01 13:30:53,401 ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed > [tm] 2019-05-01 13:30:53,418 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to > start actor system at 10.1.2.173:0 > [tm] 2019-05-01 13:30:53,420 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket > connection established to > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating > session > [tm] 2019-05-01 13:30:53,500 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - > Connected to an old server; r-o mode will be unavailable > [tm] 2019-05-01 13:30:53,500 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session > establishment complete on server > analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = > 0xbf06a739001d446, negotiated timeout = 6 > [tm] 2019-05-01 13:30:53,525 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: CONNECTED{code} > Repro: > Start an etcd-cluster, with e.g. etcd-operator, with three members. Start > zetcd in front. Configure the sesssion cluster to go against zetcd. > Ensure the job can start successfully. > Now, kill the etcd pods one by one, letting the quorum re-establish in > between, so that the cluster is still OK. > Now restart the job/tm pods. You'll end up in this no-mans-land. > > — > Workaround: clean out the etcd cluster and remove all its data, however, this > resets all time windows and state, despite having that saved in GCS, so it's > a crappy workaround. > > -- > > flink-conf.yaml > {{ parallelism.default: 1}} > {{ rest.address: analytics-job}} > {{ jobmanager.rpc.address: analytics-job # = resource manager's address too}} > {{ jobmanager.heap.size: 1024m}} > {{ jobmanager.rpc.port: 6123}} > {{ jobmanager.slot.request.timeout: 3}} > {{ resourcemanager.rpc.port: 6123}} > {{ high-availability.jobmanager.port: 6123}} > {{ blob.server.port: 6124}} > {{ queryable-state.server.ports: 6125}} > {{ taskmanager.heap.size: 1024m}} > {{ taskmanager.numberOfTaskSlots: 1}} > {{ web.log.path: /var/lib/log/flink/jobmanager.log}} > {{ rest.port: 8081}} > {{ rest.bind-address: 0.0.0.0}} > {{ web.submit.enable: false}} > {{ high-availability: zookeeper}} > {{ high-availability.storageDir: > gs://project-id-example_analytics/flink/zetcd/}} > {{ high-availability.zookeeper.quorum: analytics-zetcd:2181}} > {{ high-availability.zookeeper.path.root: /flink}} > {{ high-availability.zookeeper.client.acl: open}} > {{ state.backend: rocksdb}} > {{ state.checkpoints.num-retained: 3}} > {{ state.checkpoints.dir: > gs://project-id-example_analytics/flink/checkpoints}} > {{ state.savepoints.dir: > gs://}}{{project-id-example}}{{_analytics/flink/savepoints}} > {{ state.backend.incremental: true}} > {{ state.backend.async: true}} > {{ fs.hdfs.hadoopconf: /opt/flink/hadoop}} > {{ log.file: /var/lib/log/flink/jobmanager.log}} -- This message was sent by
[jira] [Updated] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
[ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12384: --- Description: {code:java} [tm] 2019-05-01 13:30:53,316 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f [tm] 2019-05-01 13:30:53,384 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using configured hostname/address for TaskManager: 10.1.2.173. [tm] 2019-05-01 13:30:53,401 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed [tm] 2019-05-01 13:30:53,418 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at 10.1.2.173:0 [tm] 2019-05-01 13:30:53,420 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session [tm] 2019-05-01 13:30:53,500 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable [tm] 2019-05-01 13:30:53,500 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 0xbf06a739001d446, negotiated timeout = 6 [tm] 2019-05-01 13:30:53,525 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED{code} Repro: Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd in front. Configure the sesssion cluster to go against zetcd. Ensure the job can start successfully. Now, kill the etcd pods one by one, letting the quorum re-establish in between, so that the cluster is still OK. Now restart the job/tm pods. You'll end up in this no-mans-land. — Workaround: clean out the etcd cluster and remove all its data, however, this resets all time windows and state, despite having that saved in GCS, so it's a crappy workaround. -- flink-conf.yaml {{ parallelism.default: 1}} {{ rest.address: analytics-job}} {{ jobmanager.rpc.address: analytics-job # = resource manager's address too}} {{ jobmanager.heap.size: 1024m}} {{ jobmanager.rpc.port: 6123}} {{ jobmanager.slot.request.timeout: 3}} {{ resourcemanager.rpc.port: 6123}} {{ high-availability.jobmanager.port: 6123}} {{ blob.server.port: 6124}} {{ queryable-state.server.ports: 6125}} {{ taskmanager.heap.size: 1024m}} {{ taskmanager.numberOfTaskSlots: 1}} {{ web.log.path: /var/lib/log/flink/jobmanager.log}} {{ rest.port: 8081}} {{ rest.bind-address: 0.0.0.0}} {{ web.submit.enable: false}} {{ high-availability: zookeeper}} {{ high-availability.storageDir: gs://project-id-example_analytics/flink/zetcd/}} {{ high-availability.zookeeper.quorum: analytics-zetcd:2181}} {{ high-availability.zookeeper.path.root: /flink}} {{ high-availability.zookeeper.client.acl: open}} {{ state.backend: rocksdb}} {{ state.checkpoints.num-retained: 3}} {{ state.checkpoints.dir: gs://project-id-example_analytics/flink/checkpoints}} {{ state.savepoints.dir: gs://}}{{project-id-example}}{{_analytics/flink/savepoints}} {{ state.backend.incremental: true}} {{ state.backend.async: true}} {{ fs.hdfs.hadoopconf: /opt/flink/hadoop}} {{ log.file: /var/lib/log/flink/jobmanager.log}} was: {code:java} [tm] 2019-05-01 13:30:53,316 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f [tm] 2019-05-01 13:30:53,384 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. [tm] 2019-05-01 13:30:53,395 INFO
[jira] [Comment Edited] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes
[ https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835800#comment-16835800 ] Henrik edited comment on FLINK-12381 at 5/8/19 6:34 PM: Yes, you can see it like that (a new cluster), I suppose. So does that mean that flink is useless without HA then? Because if I don't have HA, and the node I'm running it on, or the k8s pod I'm running it in, restarts, it's a new cluster? In the optimal world, I would not have to manually change the specification of the job that runs, without the job that runs also having been changed. I.e. it goes against declarative running of resources in a k8s cluster to manually have to change the jobid whenever the pod is restarted. was (Author: haf): Yes, you can see it like that (a new cluster), I suppose. So does that mean that flink is useless without HA then? Because if I don't have HA, and the node I'm running it on, or the k8s pod I'm running it in, restarts, it's a new cluster? > W/o HA, upon a full restart, checkpointing crashes > -- > > Key: FLINK-12381 > URL: https://issues.apache.org/jira/browse/FLINK-12381 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.8.0 > Environment: Same as FLINK-\{12379, 12377, 12376} >Reporter: Henrik >Priority: Major > > {code:java} > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > 'gs://example_bucket/flink/checkpoints//chk-16/_metadata' > already exists > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) > ... 8 more > {code} > Instead, it should either just overwrite the checkpoint or fail to start the > job completely. Partial and undefined failure is not what should happen. > > Repro: > # Set up a single purpose job cluster (which could use much better docs btw!) > # Let it run with GCS checkpointing for a while with rocksdb/gs://example > # Kill it > # Start it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint
[ https://issues.apache.org/jira/browse/FLINK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835505#comment-16835505 ] Henrik edited comment on FLINK-12379 at 5/8/19 6:29 PM: Yes, it's the "standalone-job.sh" entrypoint. [~StephanEwen] AFAIK it's the recommended way to deploy a standalone job? was (Author: haf): Yes, it's the "standalone-job.sh" entrypoint. [~StephanEwen] > Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint > > > Key: FLINK-12379 > URL: https://issues.apache.org/jira/browse/FLINK-12379 > Project: Flink > Issue Type: Bug > Components: FileSystems, Runtime / Coordination >Affects Versions: 1.8.0 > Environment: GCS + > > {code:java} > 1.8.0 > 1.8 > 2.11{code} > {code:java} > > > > > com.google.cloud.bigdataoss > gcs-connector > hadoop2-1.9.16 > > > org.apache.flink > flink-connector-filesystem_2.11 > ${flink.version} > > > org.apache.flink > flink-hadoop-fs > ${flink.version} > > > > org.apache.flink > flink-shaded-hadoop2 > ${hadoop.version}-${flink.version} > > {code} > > >Reporter: Henrik >Priority: Major > > When running one standalone-job w/ parallelism=1 + one taskmanager, you will > shortly get this crash > {code:java} > 2019-04-30 22:20:02,928 WARN org.apache.flink.runtime.jobmaster.JobMaster > - Error while processing checkpoint acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 5. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756) > at > org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > 'gs://example_bucket/flink/checkpoints//chk-5/_metadata' > already exists > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) > ... 8 more > Caused by: java.nio.file.FileAlreadyExistsException: Object > gs://example_bucket/flink/checkpoints//chk-5/_metadata > already exists. > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82) > ... 19 more > 2019-04-30 22:20:03,114 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 6 @
[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes
[ https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835800#comment-16835800 ] Henrik commented on FLINK-12381: Yes, you can see it like that (a new cluster), I suppose. So does that mean that flink is useless without HA then? Because if I don't have HA, and the node I'm running it on, or the k8s pod I'm running it in, restarts, it's a new cluster? > W/o HA, upon a full restart, checkpointing crashes > -- > > Key: FLINK-12381 > URL: https://issues.apache.org/jira/browse/FLINK-12381 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.8.0 > Environment: Same as FLINK-\{12379, 12377, 12376} >Reporter: Henrik >Priority: Major > > {code:java} > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > 'gs://example_bucket/flink/checkpoints//chk-16/_metadata' > already exists > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) > ... 8 more > {code} > Instead, it should either just overwrite the checkpoint or fail to start the > job completely. Partial and undefined failure is not what should happen. > > Repro: > # Set up a single purpose job cluster (which could use much better docs btw!) > # Let it run with GCS checkpointing for a while with rocksdb/gs://example > # Kill it > # Start it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit
[ https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835503#comment-16835503 ] Henrik commented on FLINK-12376: > Either [~haf] used a version of the connector that did not have this fix > (could you confirm [~haf]?) I'm using the latest version; the one you pinged me and said that you had a fix for some exceptions in (the one that you said would work with 1.7.x) This is the code that was running in this issue. !Screenshot 2019-05-08 at 12.41.07.png! > GCS runtime exn: Request payload size exceeds the limit > --- > > Key: FLINK-12376 > URL: https://issues.apache.org/jira/browse/FLINK-12376 > Project: Flink > Issue Type: Bug > Components: Connectors / Google Cloud PubSub >Affects Versions: 1.7.2 > Environment: FROM flink:1.8.0-scala_2.11 > ARG version=0.17 > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib > COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar >Reporter: Henrik >Assignee: Richard Deurwaarder >Priority: Major > Attachments: Screenshot 2019-04-30 at 22.32.34.png, Screenshot > 2019-05-08 at 12.41.07.png > > > I'm trying to use the google cloud storage file system, but it would seem > that the FLINK / GCS client libs are creating too-large requests far down in > the GCS Java client. > The Java client is added to the lib folder with this command in Dockerfile > (probably > [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar] > at the time of writing): > > {code:java} > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib{code} > This is the crash output. Focus lines: > {code:java} > java.lang.RuntimeException: Error while confirming checkpoint{code} > and > {code:java} > Caused by: com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes.{code} > Full stacktrace: > > {code:java} > [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO > org.apache.flink.runtime.taskmanager.Task - Source: > Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) > (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED. > [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error > while confirming checkpoint > [analytics-867c867ff6-l622h taskmanager] at > org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [analytics-867c867ff6-l622h taskmanager] at > java.lang.Thread.run(Thread.java:748) > [analytics-867c867ff6-l622h taskmanager] Caused by: > com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes. > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958) > [analytics-867c867ff6-l622h taskmanager] at >
[jira] [Comment Edited] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint
[ https://issues.apache.org/jira/browse/FLINK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835505#comment-16835505 ] Henrik edited comment on FLINK-12379 at 5/8/19 10:43 AM: - Yes, it's the "standalone-job.sh" entrypoint. [~StephanEwen] was (Author: haf): Yes, it's the "standalone-job.sh" entrypoint. > Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint > > > Key: FLINK-12379 > URL: https://issues.apache.org/jira/browse/FLINK-12379 > Project: Flink > Issue Type: Bug > Components: FileSystems, Runtime / Coordination >Affects Versions: 1.8.0 > Environment: GCS + > > {code:java} > 1.8.0 > 1.8 > 2.11{code} > {code:java} > > > > > com.google.cloud.bigdataoss > gcs-connector > hadoop2-1.9.16 > > > org.apache.flink > flink-connector-filesystem_2.11 > ${flink.version} > > > org.apache.flink > flink-hadoop-fs > ${flink.version} > > > > org.apache.flink > flink-shaded-hadoop2 > ${hadoop.version}-${flink.version} > > {code} > > >Reporter: Henrik >Priority: Major > > When running one standalone-job w/ parallelism=1 + one taskmanager, you will > shortly get this crash > {code:java} > 2019-04-30 22:20:02,928 WARN org.apache.flink.runtime.jobmaster.JobMaster > - Error while processing checkpoint acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 5. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756) > at > org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > 'gs://example_bucket/flink/checkpoints//chk-5/_metadata' > already exists > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) > ... 8 more > Caused by: java.nio.file.FileAlreadyExistsException: Object > gs://example_bucket/flink/checkpoints//chk-5/_metadata > already exists. > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82) > ... 19 more > 2019-04-30 22:20:03,114 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 6 @ 1556662802928 for job .{code} > My guess at why;
[jira] [Commented] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint
[ https://issues.apache.org/jira/browse/FLINK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835505#comment-16835505 ] Henrik commented on FLINK-12379: Yes, it's the "standalone-job.sh" entrypoint. > Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint > > > Key: FLINK-12379 > URL: https://issues.apache.org/jira/browse/FLINK-12379 > Project: Flink > Issue Type: Bug > Components: FileSystems, Runtime / Coordination >Affects Versions: 1.8.0 > Environment: GCS + > > {code:java} > 1.8.0 > 1.8 > 2.11{code} > {code:java} > > > > > com.google.cloud.bigdataoss > gcs-connector > hadoop2-1.9.16 > > > org.apache.flink > flink-connector-filesystem_2.11 > ${flink.version} > > > org.apache.flink > flink-hadoop-fs > ${flink.version} > > > > org.apache.flink > flink-shaded-hadoop2 > ${hadoop.version}-${flink.version} > > {code} > > >Reporter: Henrik >Priority: Major > > When running one standalone-job w/ parallelism=1 + one taskmanager, you will > shortly get this crash > {code:java} > 2019-04-30 22:20:02,928 WARN org.apache.flink.runtime.jobmaster.JobMaster > - Error while processing checkpoint acknowledgement message > org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize > the pending checkpoint 5. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756) > at > org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > 'gs://example_bucket/flink/checkpoints//chk-5/_metadata' > already exists > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) > ... 8 more > Caused by: java.nio.file.FileAlreadyExistsException: Object > gs://example_bucket/flink/checkpoints//chk-5/_metadata > already exists. > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286) > at > com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82) > ... 19 more > 2019-04-30 22:20:03,114 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 6 @ 1556662802928 for job .{code} > My guess at why; concurrent checkpoint writers are updating the _metadata > resource concurrently. They should be using optimistic concurrency
[jira] [Updated] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit
[ https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12376: --- Attachment: Screenshot 2019-05-08 at 12.41.07.png > GCS runtime exn: Request payload size exceeds the limit > --- > > Key: FLINK-12376 > URL: https://issues.apache.org/jira/browse/FLINK-12376 > Project: Flink > Issue Type: Bug > Components: Connectors / Google Cloud PubSub >Affects Versions: 1.7.2 > Environment: FROM flink:1.8.0-scala_2.11 > ARG version=0.17 > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib > COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar >Reporter: Henrik >Assignee: Richard Deurwaarder >Priority: Major > Attachments: Screenshot 2019-04-30 at 22.32.34.png, Screenshot > 2019-05-08 at 12.41.07.png > > > I'm trying to use the google cloud storage file system, but it would seem > that the FLINK / GCS client libs are creating too-large requests far down in > the GCS Java client. > The Java client is added to the lib folder with this command in Dockerfile > (probably > [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar] > at the time of writing): > > {code:java} > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib{code} > This is the crash output. Focus lines: > {code:java} > java.lang.RuntimeException: Error while confirming checkpoint{code} > and > {code:java} > Caused by: com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes.{code} > Full stacktrace: > > {code:java} > [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO > org.apache.flink.runtime.taskmanager.Task - Source: > Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) > (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED. > [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error > while confirming checkpoint > [analytics-867c867ff6-l622h taskmanager] at > org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [analytics-867c867ff6-l622h taskmanager] at > java.lang.Thread.run(Thread.java:748) > [analytics-867c867ff6-l622h taskmanager] Caused by: > com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes. > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748) > [analytics-867c867ff6-l622h taskmanager] at > io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507) > [analytics-867c867ff6-l622h taskmanager] at > io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482) > [analytics-867c867ff6-l622h taskmanager] at >
[jira] [Commented] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes
[ https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835507#comment-16835507 ] Henrik commented on FLINK-12381: It's just a single "cluster", so I don't think what you write above applies. There's only a single app/job. So I don't see how setting a job id helps. > W/o HA, upon a full restart, checkpointing crashes > -- > > Key: FLINK-12381 > URL: https://issues.apache.org/jira/browse/FLINK-12381 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.8.0 > Environment: Same as FLINK-\{12379, 12377, 12376} >Reporter: Henrik >Priority: Major > > {code:java} > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > 'gs://example_bucket/flink/checkpoints//chk-16/_metadata' > already exists > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) > ... 8 more > {code} > Instead, it should either just overwrite the checkpoint or fail to start the > job completely. Partial and undefined failure is not what should happen. > > Repro: > # Set up a single purpose job cluster (which could use much better docs btw!) > # Let it run with GCS checkpointing for a while with rocksdb/gs://example > # Kill it > # Start it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
[ https://issues.apache.org/jira/browse/FLINK-12384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12384: --- Description: {code:java} [tm] 2019-05-01 13:30:53,316 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f [tm] 2019-05-01 13:30:53,384 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using configured hostname/address for TaskManager: 10.1.2.173. [tm] 2019-05-01 13:30:53,401 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed [tm] 2019-05-01 13:30:53,418 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at 10.1.2.173:0 [tm] 2019-05-01 13:30:53,420 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session [tm] 2019-05-01 13:30:53,500 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable [tm] 2019-05-01 13:30:53,500 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 0xbf06a739001d446, negotiated timeout = 6 [tm] 2019-05-01 13:30:53,525 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED{code} Repro: Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd in front. Configure the sesssion cluster to go against zetcd. Ensure the job can start successfully. Now, kill the etcd pods one by one, letting the quorum re-establish in between, so that the cluster is still OK. Now restart the job/tm pods. You'll end up in this no-mans-land. --- Workaround: clean out the etcd cluster and remove all its data, however, this resets all time windows and state, despite having that saved in GCS, so it's a crappy workaround. was: {code:java} [tm] 2019-05-01 13:30:53,316 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f [tm] 2019-05-01 13:30:53,384 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using configured hostname/address for TaskManager: 10.1.2.173. [tm] 2019-05-01 13:30:53,401 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed [tm] 2019-05-01 13:30:53,418 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at 10.1.2.173:0 [tm] 2019-05-01 13:30:53,420 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session [tm] 2019-05-01 13:30:53,500 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable [tm] 2019-05-01 13:30:53,500 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 0xbf06a739001d446, negotiated timeout = 6 [tm] 2019-05-01 13:30:53,525 INFO
[jira] [Created] (FLINK-12384) Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable"
Henrik created FLINK-12384: -- Summary: Rolling the etcd servers causes "Connected to an old server; r-o mode will be unavailable" Key: FLINK-12384 URL: https://issues.apache.org/jira/browse/FLINK-12384 Project: Flink Issue Type: Bug Reporter: Henrik {code:java} [tm] 2019-05-01 13:30:53,316 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=analytics-zetcd:2181 sessionTimeout=6 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@5c8eee0f [tm] 2019-05-01 13:30:53,384 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-3674237213070587877.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181 [tm] 2019-05-01 13:30:53,395 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Using configured hostname/address for TaskManager: 10.1.2.173. [tm] 2019-05-01 13:30:53,401 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed [tm] 2019-05-01 13:30:53,418 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at 10.1.2.173:0 [tm] 2019-05-01 13:30:53,420 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, initiating session [tm] 2019-05-01 13:30:53,500 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable [tm] 2019-05-01 13:30:53,500 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server analytics-zetcd.default.svc.cluster.local/10.108.52.97:2181, sessionid = 0xbf06a739001d446, negotiated timeout = 6 [tm] 2019-05-01 13:30:53,525 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED{code} Repro: Start an etcd-cluster, with e.g. etcd-operator, with three members. Start zetcd in front. Configure the sesssion cluster to go against zetcd. Ensure the job can start successfully. Now, kill the etcd pods one by one, letting the quorum re-establish in between, so that the cluster is still OK. Now restart the job/tm pods. You'll end up in this no-mans-land. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12383) "Log file environment variable 'log.file' is not set" despite web.log.path being set
Henrik created FLINK-12383: -- Summary: "Log file environment variable 'log.file' is not set" despite web.log.path being set Key: FLINK-12383 URL: https://issues.apache.org/jira/browse/FLINK-12383 Project: Flink Issue Type: Bug Components: Runtime / Web Frontend Affects Versions: 1.8.0 Reporter: Henrik You get these warnings when starting a session cluster, despite having configured all things logs as specified by the configuration reference on the [web site|https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#web-frontend]: {code:java} [job] 2019-05-01 13:25:35,418 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set. [job] 2019-05-01 13:25:35,419 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component log file: /var/lib/log/flink/jobmanager.log [job] 2019-05-01 13:25:35,419 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component stdout file: /var/lib/log/flink/jobmanager.out {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12382) HA + ResourceManager exception: Fencing token not set
[ https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12382: --- Description: I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the job-cluster and task-manager at the same time, but maintain ZK/zetcd up and running. Then I get this stacktrace, that's completely non-actionable for me, and also resolves itself. I expect a number of retries, and if this exception is part of the protocol signalling to retry, then it should not be printed as a log entry. This might be related to an older bug: [https://jira.apache.org/jira/browse/FLINK-7734] {code:java} [tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager failed due to an error [tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) [tm] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) [tm] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) [tm] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) [tm] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) [tm] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) [tm] at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815) [tm] at akka.dispatch.OnComplete.internal(Future.scala:258) [tm] at akka.dispatch.OnComplete.internal(Future.scala:256) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) [tm] at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) [tm] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) [tm] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) [tm] at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] ... 9 more [tm] 2019-05-01 11:32:01,650 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Pausing and re-attempting registration in 1 ms [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close
[jira] [Updated] (FLINK-12382) HA + ResourceManager exception: Fencing token not set
[ https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12382: --- Description: I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the job-cluster and task-manager at the same time, but maintain ZK/zetcd up and running. Then I get this stacktrace, that's completely non-actionable for me, and also resolves itself. I expect a number of retries, and if this exception is part of the protocol signalling to retry, then it should not be printed as a log entry. This might be related to an older bug: [https://jira.apache.org/jira/browse/FLINK-7734] {code:java} [tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager failed due to an error [tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) [tm] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) [tm] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) [tm] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) [tm] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) [tm] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) [tm] at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815) [tm] at akka.dispatch.OnComplete.internal(Future.scala:258) [tm] at akka.dispatch.OnComplete.internal(Future.scala:256) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) [tm] at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) [tm] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) [tm] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) [tm] at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] ... 9 more [tm] 2019-05-01 11:32:01,650 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Pausing and re-attempting registration in 1 ms [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close
[jira] [Created] (FLINK-12382) HA + ResourceManager exception: Fencing token not set
Henrik created FLINK-12382: -- Summary: HA + ResourceManager exception: Fencing token not set Key: FLINK-12382 URL: https://issues.apache.org/jira/browse/FLINK-12382 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.8.0 Environment: Same all all previous bugs filed by myself, today, but this time with HA with zetcd. Reporter: Henrik I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the job-cluster and task-manager at the same time, but maintain ZK/zetcd up and running. Then I get this stacktrace, that's completely non-actionable for me, and also resolves itself. I expect a number of retries, and if this exception is part of the protocol signalling to retry, then it should not be printed as a log entry. This might be related to an older bug: https://jira.apache.org/jira/browse/FLINK-7734 {code:java} [tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager failed due to an error [tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) [tm] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) [tm] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) [tm] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) [tm] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) [tm] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) [tm] at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815) [tm] at akka.dispatch.OnComplete.internal(Future.scala:258) [tm] at akka.dispatch.OnComplete.internal(Future.scala:256) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) [tm] at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) [tm] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) [tm] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) [tm] at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] ... 9 more [tm] 2019-05-01 11:32:01,650 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Pausing and re-attempting
[jira] [Updated] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes
[ https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12381: --- Description: {code:java} Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 'gs://example_bucket/flink/checkpoints//chk-16/_metadata' already exists at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) at org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) at org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) ... 8 more {code} Instead, it should either just overwrite the checkpoint or fail to start the job completely. Partial and undefined failure is not what should happen. Repro: # Set up a single purpose job cluster (which could use much better docs btw!) # Let it run with GCS checkpointing for a while with rocksdb/gs://example # Kill it # Start it was: {code:java} Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 'gs://example_bucket/flink/checkpoints//chk-16/_metadata' already exists at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) at org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) at org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) ... 8 more {code} Instead, it should either just overwrite the checkpoint or fail to start the job completely. Partial and undefined failure is not what should happen. > W/o HA, upon a full restart, checkpointing crashes > -- > > Key: FLINK-12381 > URL: https://issues.apache.org/jira/browse/FLINK-12381 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.8.0 > Environment: Same as FLINK-\{12379, 12377, 12376} >Reporter: Henrik >Priority: Major > > {code:java} > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > 'gs://example_bucket/flink/checkpoints//chk-16/_metadata' > already exists > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) > at >
[jira] [Updated] (FLINK-12381) W/o HA, upon a full restart, checkpointing crashes
[ https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12381: --- Summary: W/o HA, upon a full restart, checkpointing crashes (was: Without failover (aka "HA") configured, full restarts' checkpointing crashes) > W/o HA, upon a full restart, checkpointing crashes > -- > > Key: FLINK-12381 > URL: https://issues.apache.org/jira/browse/FLINK-12381 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.8.0 > Environment: Same as FLINK-\{12379, 12377, 12376} >Reporter: Henrik >Priority: Major > > {code:java} > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > 'gs://example_bucket/flink/checkpoints//chk-16/_metadata' > already exists > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) > at > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) > ... 8 more > {code} > Instead, it should either just overwrite the checkpoint or fail to start the > job completely. Partial and undefined failure is not what should happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12381) Without failover (aka "HA") configured, full restarts' checkpointing crashes
Henrik created FLINK-12381: -- Summary: Without failover (aka "HA") configured, full restarts' checkpointing crashes Key: FLINK-12381 URL: https://issues.apache.org/jira/browse/FLINK-12381 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.8.0 Environment: Same as FLINK-\{12379, 12377, 12376} Reporter: Henrik {code:java} Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 'gs://example_bucket/flink/checkpoints//chk-16/_metadata' already exists at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) at org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) at org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) ... 8 more {code} Instead, it should either just overwrite the checkpoint or fail to start the job completely. Partial and undefined failure is not what should happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint
[ https://issues.apache.org/jira/browse/FLINK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12379: --- Description: When running one standalone-job w/ parallelism=1 + one taskmanager, you will shortly get this crash {code:java} 2019-04-30 22:20:02,928 WARN org.apache.flink.runtime.jobmaster.JobMaster - Error while processing checkpoint acknowledgement message org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the pending checkpoint 5. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756) at org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 'gs://example_bucket/flink/checkpoints//chk-5/_metadata' already exists at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) at org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) at org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) ... 8 more Caused by: java.nio.file.FileAlreadyExistsException: Object gs://example_bucket/flink/checkpoints//chk-5/_metadata already exists. at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82) ... 19 more 2019-04-30 22:20:03,114 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 6 @ 1556662802928 for job .{code} My guess at why; concurrent checkpoint writers are updating the _metadata resource concurrently. They should be using optimistic concurrency control with ETag on GCS, and then retry until successful. When running with parallelism=4, you always seem to get this, even after deleting all checkpoints: {code:java} [analytics-job-cluster-668886c96b-mhqmc job] 2019-04-30 22:50:35,175 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: Custom Source -> Process -> Timestamps/Watermarks -> our_events (1/4) of job is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.{code} Or in short: with parallelism > 1, your Flink job never makes progress. was: When running a standalone-job w/ parallelism=4 + taskmanager, you will shortly get this crash {code:java} 2019-04-30 22:20:02,928 WARN org.apache.flink.runtime.jobmaster.JobMaster - Error while processing checkpoint acknowledgement message org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the pending checkpoint 5. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837) at
[jira] [Created] (FLINK-12379) Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint
Henrik created FLINK-12379: -- Summary: Parallelism in job/GCS/Hadoop: Could not finalize the pending checkpoint Key: FLINK-12379 URL: https://issues.apache.org/jira/browse/FLINK-12379 Project: Flink Issue Type: Bug Components: FileSystems Affects Versions: 1.8.0 Environment: GCS + {code:java} 1.8.0 1.8 2.11{code} {code:java} com.google.cloud.bigdataoss gcs-connector hadoop2-1.9.16 org.apache.flink flink-connector-filesystem_2.11 ${flink.version} org.apache.flink flink-hadoop-fs ${flink.version} org.apache.flink flink-shaded-hadoop2 ${hadoop.version}-${flink.version} {code} Reporter: Henrik When running a standalone-job w/ parallelism=4 + taskmanager, you will shortly get this crash {code:java} 2019-04-30 22:20:02,928 WARN org.apache.flink.runtime.jobmaster.JobMaster - Error while processing checkpoint acknowledgement message org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the pending checkpoint 5. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:837) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756) at org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$9(JobMaster.java:676) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 'gs://example_bucket/flink/checkpoints//chk-5/_metadata' already exists at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.(GoogleHadoopOutputStream.java:74) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) at org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65) at org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104) at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) ... 8 more Caused by: java.nio.file.FileAlreadyExistsException: Object gs://example_bucket/flink/checkpoints//chk-5/_metadata already exists. at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1918) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:410) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:286) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:264) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:82) ... 19 more 2019-04-30 22:20:03,114 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 6 @ 1556662802928 for job .{code} My guess at why; concurrent checkpoint writers are updating the _metadata resource concurrently. They should be using optimistic concurrency control with ETag on GCS, and then retry until successful. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit
[ https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12376: --- Environment: FROM flink:1.8.0-scala_2.11 ARG version=0.17 ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar /opt/flink/lib COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar was:* k8s latest docker-for-desktop on macOS, and scala 2.11-compiled Flink > GCS runtime exn: Request payload size exceeds the limit > --- > > Key: FLINK-12376 > URL: https://issues.apache.org/jira/browse/FLINK-12376 > Project: Flink > Issue Type: Bug > Components: FileSystems >Affects Versions: 1.7.2 > Environment: FROM flink:1.8.0-scala_2.11 > ARG version=0.17 > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib > COPY target/analytics-${version}.jar /opt/flink/lib/analytics.jar >Reporter: Henrik >Priority: Major > Attachments: Screenshot 2019-04-30 at 22.32.34.png > > > I'm trying to use the google cloud storage file system, but it would seem > that the FLINK / GCS client libs are creating too-large requests far down in > the GCS Java client. > The Java client is added to the lib folder with this command in Dockerfile > (probably > [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar] > at the time of writing): > > {code:java} > ADD > https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar > /opt/flink/lib{code} > This is the crash output. Focus lines: > {code:java} > java.lang.RuntimeException: Error while confirming checkpoint{code} > and > {code:java} > Caused by: com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes.{code} > Full stacktrace: > > {code:java} > [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO > org.apache.flink.runtime.taskmanager.Task - Source: > Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) > (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED. > [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error > while confirming checkpoint > [analytics-867c867ff6-l622h taskmanager] at > org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [analytics-867c867ff6-l622h taskmanager] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [analytics-867c867ff6-l622h taskmanager] at > java.lang.Thread.run(Thread.java:748) > [analytics-867c867ff6-l622h taskmanager] Caused by: > com.google.api.gax.rpc.InvalidArgumentException: > io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size > exceeds the limit: 524288 bytes. > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) > [analytics-867c867ff6-l622h taskmanager] at > com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958) > [analytics-867c867ff6-l622h taskmanager] at > com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748) > [analytics-867c867ff6-l622h taskmanager] at > io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507) > [analytics-867c867ff6-l622h
[jira] [Created] (FLINK-12377) Outdated docs (Flink 0.10) for Google Compute Engine
Henrik created FLINK-12377: -- Summary: Outdated docs (Flink 0.10) for Google Compute Engine Key: FLINK-12377 URL: https://issues.apache.org/jira/browse/FLINK-12377 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.8.0 Reporter: Henrik [https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/gce_setup.html] links to [https://github.com/GoogleCloudPlatform/bdutil/blob/master/extensions/flink/flink_env.sh] which uses ancient versions of Hadoop and Flink. Also the barrier to a newcomer is that bdutil itself is deprecated and the readme recommends DataFlow instead. Furthermore, perhaps it would be wise to include GCP in the built-in filesystems in [https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/filesystems.html?|https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/filesystems.html] Further, [https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/filesystems.html#hdfs-and-hadoop-file-system-support] doesn't actually link to any of the other configuration pages for thees other hadoop-based filesystems, nor does it explain how what exact library needs to be in the flink `lib folder; so it's really hard to go any further from there. Lastly, it would seem the 1.8.0 release has stopped shipping the Hadoop libs, without documenting the change required, e.g. in the Hadoop File System page linked above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit
[ https://issues.apache.org/jira/browse/FLINK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik updated FLINK-12376: --- Description: I'm trying to use the google cloud storage file system, but it would seem that the FLINK / GCS client libs are creating too-large requests far down in the GCS Java client. The Java client is added to the lib folder with this command in Dockerfile (probably [hadoop2-1.9.16|https://search.maven.org/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-1.9.16/jar] at the time of writing): {code:java} ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar /opt/flink/lib{code} This is the crash output. Focus lines: {code:java} java.lang.RuntimeException: Error while confirming checkpoint{code} and {code:java} Caused by: com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds the limit: 524288 bytes.{code} Full stacktrace: {code:java} [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED. [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error while confirming checkpoint [analytics-867c867ff6-l622h taskmanager] at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211) [analytics-867c867ff6-l622h taskmanager] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [analytics-867c867ff6-l622h taskmanager] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [analytics-867c867ff6-l622h taskmanager] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [analytics-867c867ff6-l622h taskmanager] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [analytics-867c867ff6-l622h taskmanager] at java.lang.Thread.run(Thread.java:748) [analytics-867c867ff6-l622h taskmanager] Caused by: com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds the limit: 524288 bytes. [analytics-867c867ff6-l622h taskmanager] at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) [analytics-867c867ff6-l622h taskmanager] at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) [analytics-867c867ff6-l622h taskmanager] at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) [analytics-867c867ff6-l622h taskmanager] at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) [analytics-867c867ff6-l622h taskmanager] at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748) [analytics-867c867ff6-l622h taskmanager] at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507) [analytics-867c867ff6-l622h taskmanager] at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482) [analytics-867c867ff6-l622h taskmanager] at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) [analytics-867c867ff6-l622h taskmanager] at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) [analytics-867c867ff6-l622h taskmanager] at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) [analytics-867c867ff6-l622h taskmanager] at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694) [analytics-867c867ff6-l622h taskmanager] at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) [analytics-867c867ff6-l622h taskmanager] at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) [analytics-867c867ff6-l622h taskmanager] at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
[jira] [Created] (FLINK-12376) GCS runtime exn: Request payload size exceeds the limit
Henrik created FLINK-12376: -- Summary: GCS runtime exn: Request payload size exceeds the limit Key: FLINK-12376 URL: https://issues.apache.org/jira/browse/FLINK-12376 Project: Flink Issue Type: Bug Components: FileSystems Affects Versions: 1.7.2 Environment: * k8s latest docker-for-desktop on macOS, and scala 2.11-compiled Flink Reporter: Henrik Attachments: Screenshot 2019-04-30 at 22.32.34.png I'm trying to use the google cloud storage file system, but it would seem that the FLINK / GCS client libs are creating too-large requests far down in the GCS Java client. {code:java} [analytics-867c867ff6-l622h taskmanager] 2019-04-30 20:23:14,532 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (9a01e96c0271025d5ba73b735847cd4c) switched from RUNNING to FAILED. [analytics-867c867ff6-l622h taskmanager] java.lang.RuntimeException: Error while confirming checkpoint [analytics-867c867ff6-l622h taskmanager] at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1211) [analytics-867c867ff6-l622h taskmanager] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [analytics-867c867ff6-l622h taskmanager] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [analytics-867c867ff6-l622h taskmanager] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [analytics-867c867ff6-l622h taskmanager] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [analytics-867c867ff6-l622h taskmanager] at java.lang.Thread.run(Thread.java:748) [analytics-867c867ff6-l622h taskmanager] Caused by: com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Request payload size exceeds the limit: 524288 bytes. [analytics-867c867ff6-l622h taskmanager] at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:49) [analytics-867c867ff6-l622h taskmanager] at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) [analytics-867c867ff6-l622h taskmanager] at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) [analytics-867c867ff6-l622h taskmanager] at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) [analytics-867c867ff6-l622h taskmanager] at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958) [analytics-867c867ff6-l622h taskmanager] at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748) [analytics-867c867ff6-l622h taskmanager] at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507) [analytics-867c867ff6-l622h taskmanager] at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482) [analytics-867c867ff6-l622h taskmanager] at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) [analytics-867c867ff6-l622h taskmanager] at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) [analytics-867c867ff6-l622h taskmanager] at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) [analytics-867c867ff6-l622h taskmanager] at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694) [analytics-867c867ff6-l622h taskmanager] at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) [analytics-867c867ff6-l622h taskmanager] at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) [analytics-867c867ff6-l622h taskmanager] at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) [analytics-867c867ff6-l622h taskmanager] at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397) [analytics-867c867ff6-l622h taskmanager] at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459) [analytics-867c867ff6-l622h taskmanager] at