环境:

flinksql 1.12.2

k8s session模式

描述:

当kafka 端口错误,过一段时间会有如下报错:

2021-04-25 16:49:50

org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired 
before the position for partition filebeat_json_install_log-3 could be 
determined

当kafka ip错误,过一段时间会有如下报错:

2021-04-25 20:12:53

org.apache.flink.runtime.JobException: Recovery is suppressed by 
NoRestartBackoffTimeStrategy

at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:118)

at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:80)

at 
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:233)

at 
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:224)

at 
org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:215)

at 
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:669)

at 
org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:89)

at 
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:447)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)

at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)

at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)

at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)

at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)

at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)

at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)

at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)

at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)

at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)

at akka.actor.Actor$class.aroundReceive(Actor.scala:517)

at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)

at akka.actor.ActorCell.invoke(ActorCell.scala:561)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)

at akka.dispatch.Mailbox.run(Mailbox.scala:225)

at akka.dispatch.Mailbox.exec(Mailbox.scala:235)

at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired 
while fetching topic metadata







然后对任务执行停止取消操作,会得到如下错误

2021-04-25 08:53:41,115 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job 
v2_ods_device_action_log (fcc451b8a521398b10e5b86153141fbf) switched from state 
CANCELLING to CANCELED.

2021-04-25 08:53:41,115 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping 
checkpoint coordinator for job fcc451b8a521398b10e5b86153141fbf.

2021-04-25 08:53:41,115 INFO  
org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - 
Shutting down

2021-04-25 08:53:41,115 INFO  
org.apache.flink.runtime.checkpoint.CompletedCheckpoint      [] - Checkpoint 
with ID 1 at 
'oss://tanwan-datahub/test/flinksql/checkpoints/fcc451b8a521398b10e5b86153141fbf/chk-1'
 not discarded.

2021-04-25 08:53:41,115 INFO  
org.apache.flink.runtime.checkpoint.CompletedCheckpoint      [] - Checkpoint 
with ID 2 at 
'oss://tanwan-datahub/test/flinksql/checkpoints/fcc451b8a521398b10e5b86153141fbf/chk-2'
 not discarded.

2021-04-25 08:53:41,116 INFO  
org.apache.flink.runtime.checkpoint.CompletedCheckpoint      [] - Checkpoint 
with ID 3 at 
'oss://tanwan-datahub/test/flinksql/checkpoints/fcc451b8a521398b10e5b86153141fbf/chk-3'
 not discarded.

2021-04-25 08:53:41,137 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 
fcc451b8a521398b10e5b86153141fbf reached globally terminal state CANCELED.

2021-04-25 08:53:41,148 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Stopping the JobMaster for job 
v2_ods_device_action_log(fcc451b8a521398b10e5b86153141fbf).

2021-04-25 08:53:41,151 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Suspending 
SlotPool.

2021-04-25 08:53:41,151 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Close ResourceManager connection 
5bdeb8d0f65a90ecdfafd7f102050b19: JobManager is shutting down..

2021-04-25 08:53:41,151 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Stopping 
SlotPool.

2021-04-25 08:53:41,151 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Disconnect job manager 
00000000000000000000000000000...@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_3
 for job fcc451b8a521398b10e5b86153141fbf from the resource manager.

2021-04-25 08:53:41,178 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Could not 
archive completed job 
v2_ods_device_action_log(fcc451b8a521398b10e5b86153141fbf) to the history 
server.

java.util.concurrent.CompletionException: java.lang.ExceptionInInitializerError

at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_265]

at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 [?:1.8.0_265]

at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643)
 [?:1.8.0_265]

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_265]

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_265]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]

Caused by: java.lang.ExceptionInInitializerError

at 
org.apache.flink.runtime.dispatcher.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:55)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]

at 
org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:50)
 ~[template-common-jar-0.0.1.jar:?]

at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
 ~[?:1.8.0_265]

... 3 more

Caused by: java.lang.IllegalStateException: zip file closed

at java.util.zip.ZipFile.ensureOpen(ZipFile.java:686) ~[?:1.8.0_265]

at java.util.zip.ZipFile.getEntry(ZipFile.java:315) ~[?:1.8.0_265]

at java.util.jar.JarFile.getEntry(JarFile.java:240) ~[?:1.8.0_265]

at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128) 
~[?:1.8.0_265]

at java.util.jar.JarFile.getJarEntry(JarFile.java:223) ~[?:1.8.0_265]

at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1054) 
~[?:1.8.0_265]

at sun.misc.URLClassPath.getResource(URLClassPath.java:249) ~[?:1.8.0_265]

at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[?:1.8.0_265]

at java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_265]

at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_265]

at java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_265]

at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_265]

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) ~[?:1.8.0_265]

at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_265]

at java.lang.Class.forName0(Native Method) ~[?:1.8.0_265]

at java.lang.Class.forName(Class.java:264) ~[?:1.8.0_265]

at org.apache.logging.log4j.util.LoaderUtil.loadClass(LoaderUtil.java:168) 
~[log4j-api-2.12.1.jar:2.12.1]

at org.apache.logging.slf4j.Log4jLogger.createConverter(Log4jLogger.java:416) 
~[log4j-slf4j-impl-2.12.1.jar:2.12.1]

at org.apache.logging.slf4j.Log4jLogger.<init>(Log4jLogger.java:54) 
~[log4j-slf4j-impl-2.12.1.jar:2.12.1]

at 
org.apache.logging.slf4j.Log4jLoggerFactory.newLogger(Log4jLoggerFactory.java:39)
 ~[log4j-slf4j-impl-2.12.1.jar:2.12.1]

at 
org.apache.logging.slf4j.Log4jLoggerFactory.newLogger(Log4jLoggerFactory.java:30)
 ~[log4j-slf4j-impl-2.12.1.jar:2.12.1]

at 
org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:54)
 ~[log4j-api-2.12.1.jar:2.12.1]

at 
org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:30)
 ~[log4j-slf4j-impl-2.12.1.jar:2.12.1]

at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:277) 
~[template-common-jar-0.0.1.jar:?]

at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:288) 
~[template-common-jar-0.0.1.jar:?]

at 
org.apache.flink.runtime.history.FsJobArchivist.<clinit>(FsJobArchivist.java:50)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]

at 
org.apache.flink.runtime.dispatcher.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:55)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]

at 
org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:50)
 ~[template-common-jar-0.0.1.jar:?]

at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
 ~[?:1.8.0_265]

... 3 more




以后启动任何任务时都出现下面的错误

2021-04-25 08:54:06,711 INFO  org.apache.flink.client.ClientUtils               
           [] - Starting program (detached: true)

2021-04-25 08:54:06,715 INFO  
org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using 
predefined options: DEFAULT.

2021-04-25 08:54:06,715 INFO  
org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using default 
options factory: DefaultConfigurableOptionsFactory{configuredOptions={}}.

2021-04-25 08:54:06,722 ERROR 
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler   [] - Exception 
occurred in REST handler: Could not execute application.




必须得重启整个flink集群才能正常发布任务

回复